Archive Log Service Data to MaxCompute for Offline Analysis Using DataWorks
By Yi Xiu
You may have encountered parameter setting problems on partitioning or DataWorks scheduling when you ship data to MaxCompute by using DataWorks. This article provides solutions to these problems by simulating a real case as follows:
Official help document: https://www.alibabacloud.com/help/doc-detail/68322.html
Create a Data Source:
Step 1. Go to Data Integration and then go to the Data Source tab page.
Step 2. In the upper right corner,
click Add Data Source and choose Message Queue > LogHub.
Step 3. Enter the required fields in the Add LogHub Data Source dialog box: Data Source Name, LogHub
Endpoint, Project, AccessKey ID, and AccessKey Secret. Then click Test Connectivity.
Create Destination Tables:
Step 1. Click Temporary Query in the left-side navigation pane. Right click anywhere on the Query page, and select Create > ODPS SQL.
Step 2. Write the DDL statement for creating the tables.
Step 3. Click the
Run button to create the destination tables: ods_client_operation_log, ods_vedio_server_log, and ods_web_tracking_log.
Step 4. When you see the message “shell run successfully!”, these three DDL statements have been run successfully.
Step 5. Use the desc command to view the created tables.
You can use the desc command to view the other two tables, and ensure they exist.
Create a Data Synchronization Task
After creating the data source and testing the connectivity in DataWorks, you can use synchronize data from the data source to MaxCompute through a data synchronization task.
Step 1. Click
Create Business Flow and then click Confirm. Name the business flow as ApsaraVideo Live log collection.
Step 2. Successively create the following dependencies on the Business Flow Development panel.
Configure the data synchronization nodes as follows: web_tracking_log_syn, client_operation_log_syn, and vedio_server_log_syn.
Step 3. Double click
web_tracking_log_syn to enter node configuration page. The configuration items include: Data Source (Source and Destination), Mappings (Source Table and Destination Table), and Channel.
Specify the parameters based on the data collection window as follows:
Set the consumption checkpoint to once every five minutes. From 00:00 to 23:59, startTime= [yyyymmddhh24miss-10/24/60]The first 10 minutes of the system time to endTime=[yyyymmddhh24miss-5/24/60] The first five minutes of the system time (note that this time is different from the consumption checkpoint shown in the preceding figure). Then set ds=[yyyymmdd-5/24/60], hr=[hh24–5/24/60], min=[mi-5/24/60].
Step 4. Click Advanced run to perform testing.
You can perform testing by manually entering custom parameters.
Step 3. Use a SQL script to verify whether the data has already been written into the destination table, as shown in the following figure.
After synchronizing logs of Log Service to MaxCompute, you can proceed with the data processing.
For example, you can record the statistics of top channels, region distribution, and buffering lag.
The detailed SQL logic will not be elaborated here. You can implement statistical analysis based on your actual business needs. The dependency relationship is configured as shown in the preceding figure.