Schedule Data Lake Analytics Tasks by Using Airflow

This article describes how to schedule Alibaba Cloud Data Lake Analytics (DLA) tasks by using Airflow. Customers run scheduled tasks to query data from DLA (which is a data lake solution) and import the data to the business system every day. DLA is compatible with MySQL protocols. Therefore, all scheduling frameworks that support MySQL protocols also support DLA. This document describes how to schedule DLA tasks by using Apache Airflow.

The procedure is as follows:

  1. Purchase an Alibaba Cloud Elastic Compute Service (ECS) to run Airflow.
  2. Install Airflow.
  3. Add the DB connection of DLA.
  4. Develop task scripts.

Purchase and Configure an ECS

Log in to the Alibaba Cloud and navigate to the ECS console. Once there, purchase an ECS instance through the console.

When setting up your ECS, please note the following:

  1. Ensure that the region of the purchased ECS is the same as the region where you enable DLA.
  2. Assign the ECS permissions to access some Airflow web consoles over the Internet.
  3. Enabling inbound port 80 in the security group is necessary for the ECS to access the Airflow web pages through port 80, as shown in the following figure.
Image for post
Image for post

In addition, record the public IP address of the ECS.

Image for post
Image for post

Install Airflow

Airflow is compiled in the Python language and installed by using Python’s Package Manager pip. Because you use MySQL (instead of the default SQLite) as the Airflow metadatabase, you also need to install MySQL-related packages.

Modify the following configuration of the MySQL installed by default:

After the modification is complete, restart MySQL.

After Airflow is installed, the ~/airflow directory is generated in your local user directory. The directory contains the following content:

In the preceding content, airflow.cfg is the configuration file of the Airflow cluster, and you can modify configurations in this file. The dags directory saves your tasks.

Initialize the Airflow Metadatabase

After installing the MySQL database, you need to create a database to save Airflow metadata.

In this case, the metadatabase is initialized.

Install Dask

Airflow is a scheduling tool whose tasks are run by an executor. The default executor is SequentialExecutor, which is not suitable for the production environment. Distributed executors include Celery and Dask. We have tried Celery and found that it has multiple bugs. Therefore, we recommend using Dask.

Install Dask.

Run Dask scheduler.

Run Dask worker.

Configure airflow.cfg

Because you use MySQL as the metadatabase and use Dask to run tasks instead of the default configurations, you need to modify the ~/airflow/airflow.cfg file.

Start Airflow

After all preparations have been completed, you can start Airflow. You need to start the following three modules of Airflow:

webserver: used to bear the management and control pages of Airflow.

scheduler: task scheduler. It monitors changes of task files defined in ~/airflow/dags. In this way, you can view your newly developed tasks on the management console.

worker: interacts with Dask and runs tasks.

Then, an Airflow cluster is ready for you to run tasks. Some sample tasks are provided in default configurations. You can view the Airflow control page by entering http://Public IP address of your ECS in the address bar of a browser.

Image for post
Image for post

Develop tasks

The objective is to schedule DLA tasks by using Airflow. First, you need to add a connection string. In Airflow, the Connections page saves connection string information. After you enter http://Public IP address of your ECS/admin/connection/ in the address bar of a browser, the following page appears.

Image for post
Image for post

Add DLA connection information.

Image for post
Image for post

Note the following two points:

  1. Select MySQL for Conn Type. (DLA is compatible with MySQL protocols.)
  2. Conn Id is a key parameter used to access data sources in your tasks.

Develop Task Code

In this task, you define a DAG, which indicates a task process. In the process, you run multiple dependent tasks. The first DAG parameter specifies the DAG name and you set it to dlademo in this task. The third parameter specifies the scheduling interval, and you set it to timedelta(1), indicating once a day.

The first task is to run a bash command (echo hello-airflow). The second task is to query and print a table in the DLA database by using SQL. Finally, t2.set_upstream(t1) is used to configure the dependencies between the two tasks.

After entering http://Public IP address of your ECS/admin/airflow/tree?dag_id=dlademoin the address box of the browser, you can view details about the task.

Image for post
Image for post

In the preceding figure, you can view the two tasks defined and their dependencies. Airflow has abundant functions for you to experience.

Summary

Airflow is a top-level Apache project in terms of its maturity and abundant functions. It is easy to master and allows you to easily establish your clusters. In addition, Airflow has its own connection mechanism. In this way, you do not need to disclose the user name and password of the database in the task script. You can try to schedule your DLA tasks by using Airflow.

References

  1. Airflow installation manual: https://airflow.apache.org/installation.html
  2. Scaling out with Dask: https://airflow.readthedocs.io/en/stable/howto/executor/use-dask.html

To learn more about Alibaba Cloud Data Lake Analytics, visit www.alibabacloud.com/products/data-lake-analytics

Reference:https://www.alibabacloud.com/blog/schedule-data-lake-analytics-tasks-by-using-airflow_594183

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store