There’s No Need for Hadoop: Analyze Server Logs with AnalyticDB

Image for post
Image for post

By Ji Jiayu, nicknamed Qingyu at Alibaba.

The operation of analyzing server logs is one of the very first and crucial steps for any company interested in creating its own big data analytics solution. However, before such a solution can be created, a company typically will typically have to overcome one or more of the following obstacles:

  • A lack of engineers familiar with big data
  • Uncertainty around how to set up Hadoop
  • Difficulty recruiting big data specialists
  • A limited budget for the big data solution

In this article, we’re going to look at how you can overcome the second obstacles by implementing a massively parallel processing database solution with Alibaba Cloud AnalyticDB. With this solution, your company will no longer need to confront the uncertainty of how to set up Hadoop. Moreover, with the support of Alibaba Cloud, you can ultimately also overcome the other major challenges listed above.

But, with all that said, you may still be curious. Why exactly can this solution replace Hadoop, and why should I consider using it? Well, to answer this question, let’s explore why exactly Hadoop exists, what its used for, and how we can replicate it with Alibaba Cloud’s solution.

Ultimately, Hadoop exists because it can help solve the scaling problems found with traditional databases. The performance of traditional single-node relational databases can only be improved by scaling up, which means increasing the CPU, memory, or swapping hard disks. And doing so in later phases of development, a 5% improvement in computing capability can cost 10 times more than in the early phases. Hadoop can help to reduce these costs if implemented when you created your databases. Hadoop works as a distributed solution. It can scale out through using a shared-nothing (SN) architecture. In this architecture, the performance of Hadoop increases linearly when nodes with the same performance are added, which critically means that the input will be proportional to the output.

However, relational databases are not well suited to non- or semi-structured data. They store data in rows and columns, and therefore data cannot be inserted in JSON or XML format. Fortunately, Hadoop has the advantage that it can interpret these data formats, and it can do so by compiling specific input readers, which in turn allows it to process both structured and unstructured data, making Hadoop a highly convenient solution. However, even if you do choose to use Hadoop in your systems, for improved processing performance, it’s recommended that you only use structured data, and don’t use unstructured data.

Anyway, getting back to our solution, all of this tells us that, if a relational database with a shared-nothing architecture can be used to structure imported data in advance, then a big data analytics solution can indeed be performed in a massively parallel processing database, as I have proposed.

Alibaba Cloud AnalyticDB is a database management system that is well suited for this type of scenario. By implementing this solution with Alibaba Cloud, you will not only bypass the hassle involved in setting up Hadoop, but you’ll also have the chance to take advantage of the several other solutions Alibaba Cloud has to offer in its large portfolio of big data and analytics products and services.

Requirements

Before we begin with the tutorial outlined in this article, let’s first look at some of the major components required in our solution. Note that you’ll also need a valid Alibaba Cloud account to complete the steps described in this tutorial.

Image for post
Image for post

Procedure

In this section, I’ll go through the steps needed to set up our solution using Alibaba Cloud AnalyticDB.

Installing Nginx

First, before anything else, you’ll want to install Nginx. To do that, first, in CentOS, run the following command to install Nginx:

After Nginx is installed, the information shown below will appear.

Now, you’ll need to define the log format of Nginx. In this example, we wanted to collect statistics on unique visitors (UV) to a website and the response time of each request. So, therefore, to do this, we needed to add $request_time and $http_cookie. The default date format of Nginx is 07/Jul/2019:03:02:59, which is not very suitable for the queries we will use in subsequent analysis, so we changed the date format to 2019-07-11T16:32:09.

Next, you’ll want to find /etc/nginx/nginx.conf, open the file with Vim, and change the log_format value as follows:

Then, run the following command to restart Nginx so that the configuration takes effect:

Check the logs in /var/log/nginx/access.log. You will find that the log format is as follows:

By default, the home page of Nginx does not record cookies, so it’s recommended that you implement a logon system for an extra level of security.

Deploying Datahub

The address for the Datahub console is https://datahub.console.aliyun.com/datahub . To deploy Datahub, the first thing you'll want to do is create a project named log_demo, as shown below:

Image for post
Image for post

Next, you’ll need to access the log_demo project and create a topic named topic_nginx_log. Your screen should look like the one shown below:

Image for post
Image for post

Now set Topic Type to TUPLE, which is organized with a schema that is used to facilitate subsequent viewing and processing requests. The definition of the schema is strongly correlated to the fields that are obtained from logs. Given this, you'll need to ensure that the schema is free of errors when you created it. Otherwise, errors may occur when recording the logs.

Now, consider the following as a point of reference. Below is a list of field names, along with their descriptions.

Image for post
Image for post

Note also that, after the topic is created, the Schema tab of this topic should look something this:

Image for post
Image for post

Installing Logstash

Before we begin, it’s important that you know that the official version of Logstash does not provide a compliance tool for Datahub, so you’ll need to download the latest compatible version from the official Datahub website to reduce the cost of intermediate integration.

Now to install logstash, you’ll want to return to the console and run the following command to download and decompress Logstash to the destination directory:

Then, after Logstash is installed, you’ll need to complete two tasks. First, you’ll need to process log files when you capture them so that you can process the log files into structured data. Logstash’s grok can easily complete this task. Second, Logstash must be told where to capture log files, which is the log storage location in Nginx, and where to synchronize the log files, specifically Datahub.

To enable grok to understand and convert log files, you’ll need to add a new format to the grok-pattern file. This file is stored in the Logstash directory that was just downloaded. The path is as follows:

Use Vim to open this file and add the new pattern to the end of the file as follows:

In the Logstash directory, the config sub-directory contains the logstash-sample.conf file. You'll want to open the file and modify it as shown below.

Then, you can start Logstash to verify whether log files are written into Datahub. In the root directory of Logstash, run the following command:

The following will show up when Logstash has been started.

Now, you can refresh the home page of Nginx. By default, Nginx comes with a static webpage. You can open this webpage by accessing port 80. When a log is generated, Logstash will immediately display the log information. An Example is shown below.

The sample data in Datahub is shown below.

Image for post
Image for post

Configuring Your AnalayticDB Database

Configuring your AnalyticDB database setup is relatively simple. Granted, setting these configurations to best meet your needs requires a bit of knowledge about several related concepts. I’ll cover these core concepts in this section.

For the quick run down of how you can start to setting up things, you’ll first want to follow these steps. Log on to the AnalyticDB console, then when you’re there, you’ll want to create a cluster named nginx_logging. An AnalyticDB database with the same name will be created. After your AnalyticDB database is created, the cluster name, instead of the database name, is shown in the console.

Note that it takes some time for a new cluster to be initiated, so you may have to wait several minutes. After it is completed, you can click Log On to Database in the console to access Data Management (DMS) and create a database for storing log files.

Before we precede any further, let’s cover two of the several core concepts involved with Alibaba Cloud AnalyticDB.

Understanding the concept of partitions is crucial to improving the performance of a massively parallel processing database like AnalyticDB. Such a database can be scaled linearly as long as the data is distributed evenly, which in turn means that more hard disks can be utilized to improve the overall I/O performance involved in processing queries. Now, for AnalyticDB, the evenness of how data is distribution depends largely on the column that is used as a partition.

Now when it comes to configure the partition, it’s not recommended that you use a column with skewed data, such as the date or gender column. And, it’s not a good idea to use a column that contains several null values. In the example given in this tutorial, the Remote_ip column is the most suitable column to serve as a partition.

So what are Subparititions? How should you configure them? Well, subpartitions are just second-level partitions. Therefore, using subpartitions is in many ways similar to using multiple partitions, thereby dividing a table into multiple tables. Using subpartitions can be helpful as doing so can improve the performance of table-wide queries. When setting your specific configurations, you can set the number of subpartitions. When the number of subpartitions exceeds the number you set, AnalyticDB will automatically delete the oldest subpartition clearing historical data and keeping with the numbr you set.

In this example shown below, the date column is used to as a subpartition, which makes it easier to set the data retention time. If you want to retain the log data of the past 30 days, you’ll simply need to set the number of subpartitions to 30. Note that the data in a subpartition column must be of the integer type, such as bigint or long. So, not all columns can serve as a subpartition.

The SQL statement for creating an Nginx table is shown below.

Note that AnalyticDB, by default, is accessed through public or classic networks connection. If you’d like to have private connection using virtual private cloud, you’ll need to configure the IP address of your virtual private cloud to enable AnalyticDB to work with it. Alternatively, given that AnalyticDB is completely compatible with the MySQL protocol, you can also use a public IP address to log on to AnalyticDB from a local server.

In this tutorial, Datahub accesses AnalyticDB through a classic network connection.

Configuring DataConnector

Now, return to the topic named topic_nginx_log we created previously. The +DataConnector button is available in the upper-right corner. You need to enter the following information related to your AnalyticDB database:

After this configuration is complete, you’ll be able to find the relevant information under DataConnectors. If the data records are available in Datahub, the data records can also be found when the Nginx table is queried in AnalyticDB.

Image for post
Image for post

Conclusion

If you have followed the steps described in this tutorial, you have created a massively parallel processing database system that does not require you to set up Hadoop. Stayed tuned for more helpful to tutorials from Alibaba Cloud.

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store