Archive Log Service Data to MaxCompute for Offline Analysis Using DataWorks

By Yi Xiu

You may have encountered parameter setting problems on partitioning or DataWorks scheduling when you ship data to MaxCompute by using DataWorks. This article provides solutions to these problems by simulating a real case as follows:

Image for post
Image for post

Official help document:

Create a Data Source:

Step 1. Go to Data Integration and then go to the Data Source tab page.

Step 2. In the upper right corner,

click Add Data Source and choose Message Queue > LogHub.

Step 3. Enter the required fields in the Add LogHub Data Source dialog box: Data Source Name, LogHub

Endpoint, Project, AccessKey ID, and AccessKey Secret. Then click Test Connectivity.

Create Destination Tables:

Step 1. Click Temporary Query in the left-side navigation pane. Right click anywhere on the Query page, and select Create > ODPS SQL.

Step 2. Write the DDL statement for creating the tables.

Step 3. Click the

Run button to create the destination tables: ods_client_operation_log, ods_vedio_server_log, and ods_web_tracking_log.

Step 4. When you see the message “shell run successfully!”, these three DDL statements have been run successfully.

Image for post
Image for post

Step 5. Use the desc command to view the created tables.

You can use the desc command to view the other two tables, and ensure they exist.

Create a Data Synchronization Task

After creating the data source and testing the connectivity in DataWorks, you can use synchronize data from the data source to MaxCompute through a data synchronization task.


Step 1. Click

Create Business Flow and then click Confirm. Name the business flow as ApsaraVideo Live log collection.

Step 2. Successively create the following dependencies on the Business Flow Development panel.

Configure the data synchronization nodes as follows: web_tracking_log_syn, client_operation_log_syn, and vedio_server_log_syn.

Step 3. Double click

web_tracking_log_syn to enter node configuration page. The configuration items include: Data Source (Source and Destination), Mappings (Source Table and Destination Table), and Channel.

Image for post
Image for post

Specify the parameters based on the data collection window as follows:

Set the consumption checkpoint to once every five minutes. From 00:00 to 23:59, startTime= [yyyymmddhh24miss-10/24/60]The first 10 minutes of the system time to endTime=[yyyymmddhh24miss-5/24/60] The first five minutes of the system time (note that this time is different from the consumption checkpoint shown in the preceding figure). Then set ds=[yyyymmdd-5/24/60], hr=[hh24–5/24/60], min=[mi-5/24/60].

Step 4. Click Advanced run to perform testing.

Image for post
Image for post

You can perform testing by manually entering custom parameters.

Step 3. Use a SQL script to verify whether the data has already been written into the destination table, as shown in the following figure.

After synchronizing logs of Log Service to MaxCompute, you can proceed with the data processing.

For example, you can record the statistics of top channels, region distribution, and buffering lag.

Image for post
Image for post

The detailed SQL logic will not be elaborated here. You can implement statistical analysis based on your actual business needs. The dependency relationship is configured as shown in the preceding figure.


Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store