Best Practices for Migrating Data from Kafka to MaxCompute

Image for post
Image for post

By Fu Shuai

Prerequisites

Build a Kafka Cluster

Before data migration, you must ensure that your Kafka cluster works properly. In this article, we use Alibaba Cloud E-MapReduce (EMR) to build a Kafka cluster automatically. For details, see Kafka Quick Start.

The EMR Kafka version information used in this article is as follows:

  • EMR version: EMR-3.12.1
  • Cluster type: Kafka
  • Software information: Ganglia 3.7.2 ZooKeeper 3.4.12 Kafka 2.11–1.0.1 Kafka-Manager 1.3.3.16

The network type of this Kafka cluster is VPC in China East 1 (Hangzhou). The ECS compute resource of the master instance group is configured with a public IP and an internal network IP. The specific configuration is shown in the following figure.

Create a MaxCompute Project

Activate MaxCompute and create a project. In this article, we’ve created a project named bigdata_DOC in China East 1 (Hangzhou), and enabled the related DataWorks services, as shown in the following figure. For more information, see Activate MaxCompute.

Background

Kafka is a message-oriented middleware for distributed publishing and subscription, which is widely used for its high performance and high throughput and can process millions of messages per second. Kafka is applicable to stream data processing, and is mainly used in scenarios such as user behavior tracking and log collection.

A typical Kafka cluster contains several Producers, Brokers, Consumers, and a ZooKeeper cluster. The Kafka cluster manages its own configuration and performs service collaboration through ZooKeeper.

A Topic is a collection of the most commonly used messages in a Kafka cluster, and is a logical concept of message storage. The physical disk does not store the Topic, instead the specific messages in the Topic are stored on the disks of each node in the cluster according to the partitions. Multiple Producers can send messages to a Topic, and multiple Consumers can pull (consume) messages to it.

When a message is added to a partition, an offset (numbering from 0) is assigned, which is the unique number of the message in a partition.

Procedure

  • Prepare the Test Table and Data
  • Create Test Data for the Kafka Cluster

To ensure that you can log on to the Header host of the EMR cluster, and MaxCompute and DataWorks can communicate with the Header host smoothly, first configure the security group for the Header host of the EMR cluster to enable TCP ports 22 and 9092.

  • Log on the Header Host Address of the EMR Cluster

In the EMR Hadoop console, go to Cluster Management > Host List page to confirm the address of the EMR cluster Header host, and remotely connect and log on through SSH.

  • Create a Test Topic

Use the kafka-topics.sh --zookeeper emr-header-1:2181/kafka-1.0.1 --partitions 10 --replication-factor 3 --topic testkafka –create command to create the testkafka Topic used for the test. You can view the created Topic by using the kafka-topics.sh --list --zookeeper emr-header-1:2181/kafka-1.0.1 command.

  • Write Test Data

You can use the kafka-console-producer.sh --broker-list emr-header-1:9092 --topic testkafka command to simulate the Producer writing data to the testkafka Topic. Kafka is used to process streaming data, so you can write data to it continuously. To ensure the test results, we recommend that you write more than 10 data records.

  • Verify Data

To verify that the data was successfully written to Kafka, you can open an SSH window at the same time, and use the kafka-console-consumer.sh --bootstrap-server emr-header-1:9092 --topic testkafka --from-beginning command to simulate the Consumer. As shown in the following figure, you can see the written data when the operation was successful.

Create a MaxCompute Table

To ensure that MaxCompute can successfully receive Kafka data, you must first create a table on MaxCompute. In this example, a non-partitioned table is used to facilitate the test.

Log on to DataWorks to create a table. For more information, see Table Management.

You can click DDL mode to create a table. The table creation statement for this example is as follows:

Each of these columns corresponds to one of the default columns of Kafka Reader for DataWorks data integration, and you can name it yourself.

  • key indicates the key of the message.
  • value indicates the complete content of the message.
  • partition indicates the partition where the current message is located.
  • headers indicates the headers information of the current message.
  • offset indicates the offset of the current message.
  • timestamp indicates the timestamp of the current message.

Data Sync

Create a Custom Resource Group

Currently, the default DataWorks resource group cannot fully support the Kafka plug-in. You need to use a custom resource group to synchronize data. For more information about custom resource groups, see Add Task Resources.

In this article, to save resources, we use the Header host of the EMR cluster as the custom resource group. After completion, please wait until the server status changes to available.

Create and Run a Synchronization Task

In your service process, right-click the data integration, and choose Create Data Integration Node > Data Synchronization.

After creating a data synchronization node, you need to choose Kafka as the data source and ODPS as the data destination, and use the default data source odps_first. You also need to choose the newly created testkafka as the destination table. After completing the preceding configuration, click the button in the box below to switch to script mode.

The script configuration is as follows.

You can use the kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --list command on the Header host to view the group.id parameter, as well as the Group name for the Consumer.

Note: This will not show information about old ZooKeeper-based consumers.

_emr-client-metrics-handler-group

  • console-consumer-69493
  • console-consumer-83505
  • console-consumer-21030
  • console-consumer-45322
  • console-consumer-14773

Taking console-consumer-83505 as an example, you can use the kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --describe --group console-consumer-83505 command on the Header host to confirm the beginOffset and endOffset parameters.

Note: This will not show information about old ZooKeeper-based consumers.

Consumer group “console-consumer-83505” has no active members.

After the script configuration is completed, first switch the task resource group to the resource group you just created, and then click Run.

After it completes, you can view the results in the operational log. A log showing successful operation is as follows:

Result Verification

You can run SQL statements by creating a new data development task to see if data synchronized from Kafka already exists in the current table. In this example, use the select * from testkafka; statement, and click Run.

In this example, multiple data records are input in the testkafka Topic to ensure accuracy of the result. You can check whether the data is consistent with what you entered.

Original Source

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store