There’s No Need for Hadoop: Analyze Server Logs with AnalyticDB

By Ji Jiayu, nicknamed Qingyu at Alibaba.

The operation of analyzing server logs is one of the very first and crucial steps for any company interested in creating its own big data analytics solution. However, before such a solution can be created, a company typically will typically have to overcome one or more of the following obstacles:

  • A lack of engineers familiar with big data
  • Uncertainty around how to set up Hadoop
  • Difficulty recruiting big data specialists
  • A limited budget for the big data solution

In this article, we’re going to look at how you can overcome the second obstacles by implementing a massively parallel processing database solution with Alibaba Cloud AnalyticDB. With this solution, your company will no longer need to confront the uncertainty of how to set up Hadoop. Moreover, with the support of Alibaba Cloud, you can ultimately also overcome the other major challenges listed above.

But, with all that said, you may still be curious. Why exactly can this solution replace Hadoop, and why should I consider using it? Well, to answer this question, let’s explore why exactly Hadoop exists, what its used for, and how we can replicate it with Alibaba Cloud’s solution.

Ultimately, Hadoop exists because it can help solve the scaling problems found with traditional databases. The performance of traditional single-node relational databases can only be improved by scaling up, which means increasing the CPU, memory, or swapping hard disks. And doing so in later phases of development, a 5% improvement in computing capability can cost 10 times more than in the early phases. Hadoop can help to reduce these costs if implemented when you created your databases. Hadoop works as a distributed solution. It can scale out through using a shared-nothing (SN) architecture. In this architecture, the performance of Hadoop increases linearly when nodes with the same performance are added, which critically means that the input will be proportional to the output.

However, relational databases are not well suited to non- or semi-structured data. They store data in rows and columns, and therefore data cannot be inserted in JSON or XML format. Fortunately, Hadoop has the advantage that it can interpret these data formats, and it can do so by compiling specific input readers, which in turn allows it to process both structured and unstructured data, making Hadoop a highly convenient solution. However, even if you do choose to use Hadoop in your systems, for improved processing performance, it’s recommended that you only use structured data, and don’t use unstructured data.

Anyway, getting back to our solution, all of this tells us that, if a relational database with a shared-nothing architecture can be used to structure imported data in advance, then a big data analytics solution can indeed be performed in a massively parallel processing database, as I have proposed.

Alibaba Cloud AnalyticDB is a database management system that is well suited for this type of scenario. By implementing this solution with Alibaba Cloud, you will not only bypass the hassle involved in setting up Hadoop, but you’ll also have the chance to take advantage of the several other solutions Alibaba Cloud has to offer in its large portfolio of big data and analytics products and services.

Requirements

Procedure

Installing Nginx

yum install nginx

After Nginx is installed, the information shown below will appear.

已安装:
nginx.x86_64 1:1.12.2-3.el7
作为依赖被安装:
nginx-all-modules.noarch 1:1.12.2-3.el7 nginx-mod-http-geoip.x86_64 1:1.12.2-3.el7 nginx-mod-http-image-filter.x86_64 1:1.12.2-3.el7 nginx-mod-http-perl.x86_64 1:1.12.2-3.el7
nginx-mod-http-xslt-filter.x86_64 1:1.12.2-3.el7 nginx-mod-mail.x86_64 1:1.12.2-3.el7 nginx-mod-stream.x86_64 1:1.12.2-3.el7
完毕!

Now, you’ll need to define the log format of Nginx. In this example, we wanted to collect statistics on unique visitors (UV) to a website and the response time of each request. So, therefore, to do this, we needed to add $request_time and $http_cookie. The default date format of Nginx is 07/Jul/2019:03:02:59, which is not very suitable for the queries we will use in subsequent analysis, so we changed the date format to 2019-07-11T16:32:09.

Next, you’ll want to find /etc/nginx/nginx.conf, open the file with Vim, and change the log_format value as follows:

log_format  main  '$remote_addr - [$time_iso8601] "$request" '
'$status $body_bytes_sent $request_time "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" "$http_cookie"' ;

Then, run the following command to restart Nginx so that the configuration takes effect:

service nginx restart

Check the logs in /var/log/nginx/access.log. You will find that the log format is as follows:

119.35.6.17 - [2019-07-14T16:39:17+08:00] "GET / HTTP/1.1" 304 0 0.000 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36" "-" "-"

By default, the home page of Nginx does not record cookies, so it’s recommended that you implement a logon system for an extra level of security.

Deploying Datahub

Next, you’ll need to access the log_demo project and create a topic named topic_nginx_log. Your screen should look like the one shown below:

Now set Topic Type to TUPLE, which is organized with a schema that is used to facilitate subsequent viewing and processing requests. The definition of the schema is strongly correlated to the fields that are obtained from logs. Given this, you'll need to ensure that the schema is free of errors when you created it. Otherwise, errors may occur when recording the logs.

Now, consider the following as a point of reference. Below is a list of field names, along with their descriptions.

Note also that, after the topic is created, the Schema tab of this topic should look something this:

Installing Logstash

Now to install logstash, you’ll want to return to the console and run the following command to download and decompress Logstash to the destination directory:

##注意,这里的版本号和下载链接可能因为更新缘故有区别,建议到介绍页面去获取最新链接wget http://aliyun-datahub.oss-cn-hangzhou.aliyuncs.com/tools/logstash-with-datahub-6.4.0.tar.gz?spm=a2c4g.11186623.2.17.60f73452tHbnZQ&file=logstash-with-datahub-6.4.0.tar.gztar -zxvf logstash-with-datahub-6.4.0.tar.gz

Then, after Logstash is installed, you’ll need to complete two tasks. First, you’ll need to process log files when you capture them so that you can process the log files into structured data. Logstash’s grok can easily complete this task. Second, Logstash must be told where to capture log files, which is the log storage location in Nginx, and where to synchronize the log files, specifically Datahub.

Configuring grok

logstash/vendor/bundle/jruby/1.9/gems/logstash-patterns-core-xxx/patterns/grok-patterns

Use Vim to open this file and add the new pattern to the end of the file as follows:

DATE_CHS %{YEAR}\-%{MONTHNUM}\-%{MONTHDAY}
NGINXACCESS %{IP:remote_ip} \- \[%{DATE_CHS:date}T%{TIME:time}\+08:00\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes} %{NUMBER:request_time} %{QS:referer} %{QS:agent} %{QS:xforward} %{QS:cookie}

Configuring Logstash

input {
file {
path => "/var/log/nginx/access.log" #nginx的日志位置,默认在这里。
start_position => "beginning" #从什么位置开始读起日志文件,beginning表示从最开始。
}
}
filter {
grok {
match => {"message" => "%{NGINXACCESS}"} #当grok获取到一个消息时,怎么去转换格式
}
mutate {
gsub => [ "date", "-", ""] #因为日期在后面要用作ADB表的二级分区,所以需要把非数字字符去掉
}
}
output {
datahub {
access_id => "<access-key>" #从RAM用户获取access key
access_key => "<access-secrect>" #从RAM用户获取access secret
endpoint => "http://dh-cn-shenzhen-int-vpc.aliyuncs.com" #DataHub的endpoint,取决于把project建立在哪个区域
project_name => "log_demo" #刚刚在DataHub上面创建的项目名称
topic_name => "topic_nginx_log" #刚刚在DataHub上面创建的Topic名称
dirty_data_continue => true #脏数据是否继续运行
dirty_data_file => "/root/dirty_file" #脏数据文件名称,脏数据会被写入到这里
dirty_data_file_max_size => 1000 #脏数据的文件大小
}
}

Then, you can start Logstash to verify whether log files are written into Datahub. In the root directory of Logstash, run the following command:

bin/logstash -f config/logstash-sample.conf

The following will show up when Logstash has been started.

ending Logstash logs to /root/logstash-with-datahub-6.4.0/logs which is now configured via log4j2.properties
[2019-07-12T13:36:49,946][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2019-07-12T13:36:51,000][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.4.0"}
[2019-07-12T13:36:54,756][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>2, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2019-07-12T13:36:55,556][INFO ][logstash.outputs.datahub ] Init datahub success!
[2019-07-12T13:36:56,663][INFO ][logstash.inputs.file ] No sincedb_path set, generating one based on the "path" setting {:sincedb_path=>"/root/logstash-with-datahub-6.4.0/data/plugins/inputs/file/.sincedb_d883144359d3b4f516b37dba51fab2a2", :path=>["/var/log/nginx/access.log"]}
[2019-07-12T13:36:56,753][INFO ][logstash.pipeline ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x445e23fb run>"}
[2019-07-12T13:36:56,909][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2019-07-12T13:36:56,969][INFO ][filewatch.observingtail ] START, creating Discoverer, Watch with file and sincedb collections
[2019-07-12T13:36:57,650][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}

Now, you can refresh the home page of Nginx. By default, Nginx comes with a static webpage. You can open this webpage by accessing port 80. When a log is generated, Logstash will immediately display the log information. An Example is shown below.

[2019-07-12T13:36:58,740][INFO ][logstash.outputs.datahub ] [2010]Put data to datahub success, total 10

The sample data in Datahub is shown below.

Configuring Your AnalayticDB Database

For the quick run down of how you can start to setting up things, you’ll first want to follow these steps. Log on to the AnalyticDB console, then when you’re there, you’ll want to create a cluster named nginx_logging. An AnalyticDB database with the same name will be created. After your AnalyticDB database is created, the cluster name, instead of the database name, is shown in the console.

Note that it takes some time for a new cluster to be initiated, so you may have to wait several minutes. After it is completed, you can click Log On to Database in the console to access Data Management (DMS) and create a database for storing log files.

Before we precede any further, let’s cover two of the several core concepts involved with Alibaba Cloud AnalyticDB.

What Are Partitions and How Should They Be Configured

Now when it comes to configure the partition, it’s not recommended that you use a column with skewed data, such as the date or gender column. And, it’s not a good idea to use a column that contains several null values. In the example given in this tutorial, the Remote_ip column is the most suitable column to serve as a partition.

What Are Subpartitions and How Should They Be Configured

In this example shown below, the date column is used to as a subpartition, which makes it easier to set the data retention time. If you want to retain the log data of the past 30 days, you’ll simply need to set the number of subpartitions to 30. Note that the data in a subpartition column must be of the integer type, such as bigint or long. So, not all columns can serve as a subpartition.

Finishing the Configurations

CREATE TABLE nginx_logging.nginx(
remote_ip varchar NOT NULL COMMENT '',
date bigint NOT NULL COMMENT '',
time varchar NOT NULL COMMENT '',
method varchar NOT NULL COMMENT '',
request varchar COMMENT '',
http_version varchar COMMENT '',
status varchar COMMENT '',
bytes bigint COMMENT '',
request_time double COMMENT '',
referer varchar COMMENT '',
agent varchar COMMENT '',
xforward varchar COMMENT '',
cookie varchar COMMENT '',
PRIMARY KEY (remote_ip,date,time)
)
PARTITION BY HASH KEY (remote_ip) PARTITION NUM 128 -- remote_ip作为一级分区
SUBPARTITION BY LIST KEY (date) -- date作为二级分区
SUBPARTITION OPTIONS (available_partition_num = 30) -- 二级分区30个,代表保留30天的数据
TABLEGROUP logs
OPTIONS (UPDATETYPE='realtime') -- 由于要被DataHub更新,所以一定要选择是realtime的表
COMMENT ''''

Note that AnalyticDB, by default, is accessed through public or classic networks connection. If you’d like to have private connection using virtual private cloud, you’ll need to configure the IP address of your virtual private cloud to enable AnalyticDB to work with it. Alternatively, given that AnalyticDB is completely compatible with the MySQL protocol, you can also use a public IP address to log on to AnalyticDB from a local server.

In this tutorial, Datahub accesses AnalyticDB through a classic network connection.

Configuring DataConnector

Host: # IP address of the classic network that connects the ADB cluster
Port: # Port number of the classic network that connects the ADB cluster
Database: nginx_logging # Name entered when the ADB cluster is created, which is also the name of the ADB cluster
Table: nginx # Name of the table that was created
Username: <access-key> # AccessKey ID of the RAM user
Password: <access-secret> # AccessKey secret of the RAM user
Mode: replace into # In the case of a primary key conflicts, records will be overwritten.

After this configuration is complete, you’ll be able to find the relevant information under DataConnectors. If the data records are available in Datahub, the data records can also be found when the Nginx table is queried in AnalyticDB.

Conclusion

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store