Alibaba Cloud Logstash JDBC Implements Data Synchronization between Elasticsearch and Relational Databases

By Liu Xiaoguo, an Evangelist of the Elasticsearch Community in China and edited by Zhang Lei and Dayu

Released by ELK Geek

Overview

Alibaba Cloud Logstash, a Java Database Connectivity (JDBC) input plug-in, is the key to synchronizing data between Alibaba Cloud Elasticsearch and ApsaraDB for RDS (RDS) instances. It runs a loop to poll RDS instances regularly to find any records that have been inserted or changed in this loop after the last iteration. The following conditions must be met before running the plug-in properly:

1) While writing RDS documents to Elasticsearch, the “_id” field in Elasticsearch must be set to the value of the “id” field in RDS instances. This establishes a direct mapping between RDS records and Elasticsearch documents. If an RDS record is updated, it is overwritten in Elasticsearch.

Note: In Elasticsearch, document overwriting is as efficient as record updating because record updating involves deleting the old document and indexing the new document.

2) When data is inserted or updated in an RDS instance, a field for the insertion or updating time must be included in the record. This field allows Logstash to request only the documents that were edited or inserted in the loop after the last iteration. The update or insertion time of the last record it reads from the RDS instance is saved upon each polling loop. Upon the next iteration, Logstash only needs to request records that meet the condition — the update or insertion time is later than that of the last record received in the last iteration of the polling loop.

Note: The Logstash JDBC input plug-in cannot implement the “sql delete” operation. You need to manually delete the corresponding data in Elasticsearch. If the preceding condition is met, Logstash is configured to periodically request all new or edited records from the RDS instance and write them into Elasticsearch.

Prerequisites

1) Prepare the Alibaba Cloud Elasticsearch 6.7 environment and use the created account and password to log on to Kibana.

2) Prepare the Alibaba Cloud Logstash 6.7 environment.

3) Activate the RDS service and enable the Virtual Private Cloud (VPC) whitelist of Alibaba Cloud Logstash.

RDS Data Preparation

Prepare a student table as follows:

Pay special attention to the following fields in the preceding RDS configuration:

1) student: The name of the table from which data is read and synchronized to Elasticsearch.

2) Id: The unique identifier of the record. In this experiment, IDs must be mapped to Elasticsearch IDs, so data uniqueness must be guaranteed. “id” is defined as the primary key and the unique key to ensuring that each “id” appears only once in the current table. This field will be converted to “_id” for updating documents in Elasticsearch and inserting documents to Elasticsearch.

3) update_time: When any record is inserted or changed in the table, the value of this field is set to the editing time. Then, it’s possible to extract any records that have been edited since the last time Logstash requested a record from the RDS instance.

4) create_time: This field is used to track and record the time when a record is first inserted into the table.

Other fields are user-defined. You can customize multiple fields as required.

The following figure shows the created table.

Logstash Synchronization Configuration

The following Logstash pipelines implement the synchronization code described in the previous section:

Let’s take a look at the following key fields:

1) tracking_column: This field specifies unix_ts_in_secs, which marks the last document that Logstash read from the database. It is stored in the file specified by last_run_metadata_path. The value of this field determines the starting value of the document that Logstash requests in the next iteration of its polling loop.

2) unix_ts_in_secs: This field is generated by the preceding SELECT statement, which contains “update_time” as the standard UNIX timestamp (number of seconds since Epoch). This field is referenced by the “tracking_column” field. The UNIX timestamp is used to track progress. If it is used as a simple timestamp, errors may occur because conversion between Universal Metric Time (UMT) and local time is very complicated.

3) sql_last_value: The built-in parameter, including the start time of the current iteration in the Logstash polling loop. This field is referenced by the SELECT statement in the preceding JDBC input configuration. This field is set to the latest value of unix_ts_in_secs. It is used as the start point for the returned documents during a query in the Logstash polling loop. By adding this variable to the query, ensure that inserted or updated records get transferred to Elasticsearch and are not re-sent to Elasticsearch.

4) schedule: The frequency at which Logstash polls the RDS instance for changes. The cron syntax is used for this field. The value “/5” indicates that Logstash contacts the table every five seconds.

5) filter: You only need to copy the “id” value in the RDS record to the metadata field named “_id”. This field will be referenced in the output so that each document written to Elasticsearch contains a correct “_id” value. Using the metadata field, this temporary value does not create a new field. We have also removed the “id”, “@version” and “unix_ts_in_secs” fields from the documents because we do not want to write these fields into Elasticsearch.

6) output: It specifies that each document should be written into Elasticsearch and assigned with an “_id” field (extracted from the metadata fields created in the filter section).

Analysis of SELECT Statement Correctness

This section explains in detail why it is crucial to add update_time < NOW() to the SELECT statement. First, let’s take a look at a few examples that show why the two most intuitive methods do not work. Then, we will deep dive into how update_time < NOW() overcomes the problems caused by these two intuitive methods.

Intuitive Method 1

In this section, we will demonstrate what will happen if “UNIX_TIMESTAMP(update_time) > :sql_last_value” without “update_time < NOW()” is included in the WHERE clause. Then, the SELECT statement is as follows:

At first glance, this method seems to work. However, in certain edge cases, some documents are not available. For example, assume that two documents are inserted into the RDS instance per second and the SELECT statement is run in Logstash every 5 seconds. As shown in the following figure, T0 to T10 each represents a second, and R1 to R22 specify the data in the RDS instance. Assume that the first iteration of the Logstash polling loop occurs at T5. Then, it reads documents R1 to R11, as shown in the bluish-green boxes. The value stored in the sql_last_value is T5 because this is the timestamp of the last read record (R11). Assume that after Logstash reads all the documents from the RDS instance, document R12 with the timestamp T5 is inserted into the RDS instance immediately.

In the next iteration of the preceding SELECT statement, only documents with a timestamp later than T5 are extracted (as specified by WHERE (UNIX_TIMESTAMP(update_time) > :sql_last_value)). This means that the record R12 will be skipped. As shown in the following figure, records in the bluish-green boxes are read by Logstash in the current iteration, and records in gray boxes were previously read by Logstash.

Note that if you use the SELECT statement in this case, the record R12 will never be written to Elasticsearch.

Intuitive Method 2

To solve the preceding problem, you may change the logic in the WHERE clause to “greater than or equals” as follows:

However, this method also has a problem. The documents read from the RDS instance during the latest interval are repeatedly sent to Elasticsearch. Although this is unnecessary, it will not affect the correctness of the results. After the first iteration of the Logstash polling loop, the documents marked as ‘read’ from the RDS instance are shown in the following figure.

After the second iteration is completed, all the documents with a timestamp later than or equal to T5 are extracted, as shown in the following figure.

Note that R11 (in the purple box) will be sent to Elasticsearch again.

The previous two methods are not ideal. In the first method, data is lost. In the second method, redundant data is read from the RDS instance and sent to Elasticsearch.

Solutions to the Problems Caused by the Intuitive Methods

We need to find a better method. The following statement is used to send each document to Elasticsearch only once:

The current Logstash polling loop will be performed at T5. But the following condition must be met:

Therefore, only documents with a timestamp before T5 are read from the RDS instance. We have extracted all the documents of T4 but have not read any documents of T5. Therefore, sql_last_value will be set to T4 for the next iteration of the Logstash polling loop.

The following figure shows what will happen in the next iteration of the Logstash polling loop. Due to the UNIX_TIMESTAMP(update_time) > :sql_last_value statement, where sql_last_value is set to T4, document extraction starts from T5. In addition, only documents meeting the update_time < NOW() condition are extracted. Therefore, only documents with a timestamp of T9 or earlier are extracted. This means that all documents in T9 are extracted and sql_last_value is set to T9 for the next iteration. Therefore, this method eliminates the risk of retrieving only a subset of MySQL documents at any specified time interval.

Test System

Once the JDBC input plan triggers reading records from an RDS instance and writing the records to Elasticsearch, run the following query statement to view the documents in Elasticsearch:

Output similar to the following is displayed:

Then, run the following command to update the record with an id value of 20205003:

JDBC will correctly update the record. Run the following command to view an Elasticsearch document:

A document similar to the following is returned:

Note: Now _version is set to 3, update_time is different from create_time, and the stuname field has been correctly updated to the new value. In this example, the @timestamp field is not very useful, but it is added by Logstash by default.

Run the following command to complete update or insertion in the RDS instance. Also, verify whether the correct information is contained in Elasticsearch.

Statement: This article is an authorized revision of the article “How to Use Logstash and JDBC to Ensure Data Synchronization Between Elasticsearch and Relational Databases” based on the Alibaba Cloud service environment.

Source: (Page in Chinese) https://elasticstack.blog.csdn.net/

The Alibaba Cloud Elastic Stack is completely compatible with open-source Elasticsearch and has nine unique capabilities.

For more information about discounts, visit the Alibaba Cloud Elasticsearch official website.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store