Reduce Service Use by 80% by Migrating from MongoDB to Elasticsearch

Released by ELK Geek

MongoDB and Elasticsearch are two popular databases and have been subject to many debates between supporters of the two technologies. However, this article only represents the author’s individual experience rather than opinions of any group. This article covers the following two topics:

  • Why you should migrate from MongoDB to Elasticsearch
  • How to migrate from MongoDB to Elasticsearch

Background

1. Project Background

The operation logging system records the following two types of data:

1) Primary change data that describes who performs operations, what operations are performed, which system modules are operated on, when the operations are performed, what data numbers are involved, and what operation tracking numbers are assigned.

{
"dataId": 1,
"traceId": "abc",
"moduleCode": "crm_01",
"operateTime": "2019-11-11 12:12:12",
"operationId": 100,
"operationName": "Zhang San",
"departmentId": 1000,
"departmentName": "Account Department",
"operationContent": "Visit clients"
}

2) Secondary change data that describes the actual values before and after a change. Changes to multiple fields in one row of data result in multiple data entries. Therefore, a large number of such data entries are recorded.

[
{
"dataId": 1,
"traceId": "abc",
"moduleCode": "crm_01",
"operateTime": "2019-11-11 12:12:12",
"operationId": 100,
"operationName": "Zhang San",
"departmentId": 1000,
"departmentName": "Account Department",
"operationContent": "Visit clients",

"beforeValue": "20",
"afterValue": "30",
"columnName": "customerType"
},
{
"dataId": 1,
"traceId": "abc",
"moduleCode": "crm_01",
"operateTime": "2019-11-11 12:12:12",
"operationId": 100,
"operationName": "Zhang San",
"departmentId": 1000,
"departmentName": "Account Department",
"operationContent": "Visit clients",

"beforeValue": "2019-11-02",
"afterValue": "2019-11-10",
"columnName": "lastVisitDate"
}
]

2. Project Architecture

  1. When you add or edit data in the business system, an operation log record is generated and sent to a Kafka cluster, using the dataId field as the key.
  2. The data you add or edit is stored in a MySQL database.
  3. A Canal cluster subscribes to the MySQL cluster and configures the databases and tables monitored according to the modules of the business system.
  4. The Canal cluster sends the modified business data to the Kafka cluster, using the dataId field as the key.
  5. The operation log system obtains primary and secondary record data from the Kafka cluster.
  6. The operation log system writes the data to MongoDB and requires reverse query capabilities.

Figure: Workflow of the operation logging system

MongoDB Architecture

1) Servers are configured with 8-core 32 GB memory and 500 GB solid state drives (SSDs).

2) Three router servers are deployed.

3) Three configuration servers are deployed.

4) Nine shard servers are deployed.

5) Three shards are designed for primary operation records.

6) Three shards are designed for secondary operation records.

Issues

1. Search and Query

2) Queries for operation log records in the business system involve many filter criteria that can be arbitrarily combined. This is not supported by MongoDB, or by any relational database. To support this, you have to create a lot of B+Tree index combinations, which is not practical.

3) In addition, primary and secondary records contain a lot of character-type data. Therefore, both exact query and full-text search are required to query this kind of data. In these respects, MongoDB provides inadequate functions and poor performance, leading to frequent timeouts in business system queries. By contrast, Elasticsearch is a very suitable solution.

2. Technology Stack Maturity

2) Operation logs accumulate rapidly, with over 10 million new entries every day. As a result, you have to scale out your servers at short time intervals, and this process is much more complicated than Elasticsearch.

3) Each MongoDB collection contains more than 1 billion data records. As a result, the performance of a simple quest in MongoDB is inferior to a query by inverted indexes in Elasticsearch.

4) The company has different levels of experience with the Elasticsearch and MongoDB technology stacks. Elasticsearch is widely used in many projects including core projects, so the company is accustomed to the technologies and O&M of Elasticsearch. By contrast, MongoDB is suitable for nothing apart from core business scenarios. However, no one wants to risk using MongoDB in core projects, leaving MongoDB in a very embarrassing situation.

3. Same Document Formats

Migration Solution

1) Migrate the application system at the upper layer. This involves shifting from MongoDB-oriented syntax rules to Elasticsearch-oriented ones.

2) Migrate data at the lower layer from MongoDB to Elasticsearch.

1. Evaluate Elastic Capacity

2. Set Elastic Index Rules

3. Design Core Implementation Logic

  • When primary data is synchronized to the operation log system before secondary data, the primary data records and Binlog field data are pieced together first when the secondary data is written.
  • When secondary data is synchronized to the operation log system before primary data, relevant index fields in secondary indexes are updated based on the primary data.

In Elasticsearch, index data is updated according to a near-real-time refresh mechanism. Therefore, data cannot be queried through search APIs immediately after it is submitted. In this case, how can we update primary record data to secondary records? In addition, the same data ID or trace ID may be used in multiple primary records due to a lack of standardization across business departments.

Primary data is correlated to secondary data by the dataId and traceId fields. Therefore, a data update based on the update_by_query command will be invalid and incorrect if primary data and secondary data arrive at the operation log system at the same time. In addition, primary data and secondary data may be correlated to each other on a many-to-many basis, and therefore the dataId and traceId fields are not the unique identifiers of a record.

In fact, Elasticsearch is also a NoSQL database that supports key-value caching. Therefore, you can create an Elasticsearch index to serve as an intermediate cache that caches primary data or secondary data, whichever arrives first. The _id element of the index consists of the dataId and traceId fields. This allows you to find the ID of the primary or secondary data record by using an intermediate ID. Most index data models are structured as follows, where the detailId field is the _id array record of the secondary index.

{
"dataId": 1,
"traceId": "abc",
"moduleCode": "crm_01",
"operationId": 100,
"operationName": "Zhang San",
"departmentId": 1000,
"departmentName": "Account Department",
"operationContent": "Visit clients",
"detailId": [
1,
2,
3,
4,
5,
6
]
}

As mentioned above, primary records and secondary records are both stored on a Kafka shard. This allows you to call the following core Elasticsearch APIs to pull data in batches:

# Query records in secondary indexes in bulk
_mget
# Insert in bulk
bulk
# Delete intermediate temporary indexes in bulk
_delete_by_query

Migration Procedure

1. Migrate Data

  • Historical data: Operation log records are historical data that rarely need further modification once generated, making them similar to offline data.
  • Non-continuous migration: When the project is completed, the original MongoDB cluster is completely terminated and no secondary migration is required.
  • Data volume: The original MongoDB operation logs amount to billions of entries.

Therefore, the migration must be processed at an appropriate speed. An excessively fast migration leads to performance problems for the MongoDB cluster, while an excessively slow migration prolongs the project and increases O&M costs and complexity. If this is not an issue, you can select Hadoop as an intermediate platform for migration.

  • Scenario-based modifications of the DataX source code: DataX allows you to modify the source code to suit different scenarios, such as date type conversion and generation or mapping of primary index key _id fields. It also supports repeated synchronization.
  • Multi-instance and multi-thread operations in parallel: Synchronization of primary data and synchronization of secondary data are both deployed on multiple instances, and each instance is configured with multiple channels.

2. Configure Migration Indexes

"index.number_of_replicas": 0,
"index.refresh_interval": "30s",
"index.translog.flush_threshold_size": "1024M"
"index.translog.durability": "async",
"index.translog.sync_interval": "5s"

3. Migrate Applications

# Write flag mongodb in applications
writeflag.mongodb: true
# Write flag elasticsearch in applications
writeflag.elasticsearch: true

Project-based modifications:

  • When the project goes online for the first time, set the two write tags to true to enable double-write for MongoDB and Elasticsearch.
  • Two different read interfaces are provided for flexible frontend switching.
  • When data migration is completed and no difference exists, revert the values of the flags.

Summary

1. Benefits of Migration

2. Lessons Learned

About the Author

Declaration: This article is reproduced with authorization from Li Meng, the original author. The author reserves the right to hold users legally liable in the case of unauthorized use.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store