By Yang Xu, nicknamed Pinshu at Alibaba.

The progress of Flink in the machine learning field has long been the focus of many developers. This year, a new milestone was set in the Flink community when we open-sourced the Alink machine learning algorithm platform. This also marked Flink’s official entry into the field of artificial intelligence.

Yang Xu, a senior algorithm expert at Alibaba, will give you a detailed introduction to the main functions and characteristics of the Alink project. We hope to work with colleagues in the industry to promote the further development of the Flink community.

Download Alink from GitHub: https://github.com/alibaba/Alink

Background

Our team has long been engaged in the research and development of algorithm platforms, and has benefited from efficient algorithm components and easy-to-use platforms for developers. Looking at the diverse range of the emerging application scenarios of machine learning, in 2017, we started to develop a next-generation machine learning algorithm platform based on Flink. This platform allows data analysts and application developers to easily build end-to-end business processes. We named this project “Alink”, which is an amalgam of several terms closely related to our work: Alibaba, Algorithm, AI, Flink, and Blink.

What is Alink?

Leveraging Flink’s advantage in unifying batch and stream processing, Alink can provide consistent operations for batch and stream processing tasks. In practice, the limitations of FlinkML, Flink’s original machine learning library, have been exposed. It supports only about a dozen algorithm types, and the supported data structures are not universal enough. However, we admire the excellent performance of Flink’s underlying engine, and therefore redesigned and developed the machine learning algorithm library based on Flink. This project was launched within Alibaba in 2018 and has been continuously improved and perfected in complex business scenarios within Alibaba.

Since we first developed Alink, we have been working closely with the community. We have repeatedly introduced our latest progress in the R&D of the machine learning algorithm library and shared our technical experiences at the Flink Forward conference.

As the first machine learning platform in the industry to support both batch algorithms and streaming algorithms, Alink provides a Python API so that developers can easily build algorithm models without the need for Flink technology.

Alink has already been widely used in Alibaba’s core real-time online services, such as search, recommendation, and advertising. During the 2019 Double 11 Shopping Festival, 970 PB of data was processed per day, and up to 2.5 billion pieces of data were processed each second. Alink successfully withstood this massive real-time data training test and helped increase the click-through rate (CTR) by 4% on the Tmall e-commerce platform.

Open-Source

After Blink became open-source, we considered that it would be better to open source Alink’s algorithm library on Flink. However, we found that contributing to a community is actually a complicated thing. Blink already occupied a large amount of bandwidth when we were open sourcing it, and the community has only so much bandwidth. This made it impossible to do multiple things at the same time. The community also needs some time to digest new things. Therefore, we decided to let the community first digest Blink and then gradually introduce Alink to the community. This was a process that could not be overlooked.

The Relationship Between FlinkML and Alink

In the future, we hope that the Alink algorithms will gradually replace the FlinkML algorithms. Alink may become a new generation of FlinkML, but this process will take a long time. In the first half of this year, we actively participated in the design of the new FlinkML APIs and shared our experience in Alink API design. In addition, Alink’s Params and other concepts were adopted by the community. We began to contribute FlinkML code in June and have already submitted more than 40 PRs, including basic algorithm frameworks, basic utility classes, and several algorithm implementations.

Alink contains many machine learning algorithms, and therefore contributing or releasing these algorithms to Flink requires a large amount of bandwidth. We worried that the whole process would take a long time. Therefore, we first released Alink as a separate open-source project so that anyone who needs it could use it. If subsequent contributions go smoothly, Alink can be completely merged into FlinkML and become a major component in the Flink ecosystem. This is the perfect place for Alink. At this point, FlinkML would correspond exactly to Spark ML.

Advantages of Alink over Spark ML

In offline learning, Alink and Spark ML are basically similar. As long as they have done a good job in engineering, offline learning does not involve a generational difference. True generational differences arise from different design concepts. In terms of design, significant generational differences are due to differences in the product format and technical format.

Compared with Spark ML, Alink provides basically consistent batch algorithms, with similar features and performance. Alink supports all the algorithms commonly used by algorithm engineers, including clustering, classification, regression, data analysis, and feature engineering. Before we open sourced Alink, we benchmarked its algorithms against all Spark ML algorithms and achieved 100% consistency. The greatest highlight of Alink lies in streaming algorithms and online learning. Alink is unique in these areas, providing users with significant advantages and no drawbacks.

Main Functions and Advantages

A Rich Library of Efficient Algorithms

As shown in the following figure, each of the open-source algorithm modules provided by Alink contains batch and streaming algorithms. For example, the linear regression module includes linear regression model training based on batch data, linear regression-based prediction that uses streaming data, and linear regression-based prediction that uses batch data.

A Friendly User Experience

The following figure shows the use of PyAlink on Notebook, displaying model training and prediction and the result printing process.

Download and Install PyAlink

For more information about download and installation, see https://github.com/alibaba/Alink#%E5%BF%AB%E9%80%9F%E5%BC%80%E5%A7%8B--pyalink-%E4%BD%BF%E7%94%A8%E4%BB%8B%E7%BB%8D .

Use PyAlink

For more information about PyAlink examples, see https://github.com/alibaba/Alink/tree/master/pyalink .

The following animated examples show how to use PyAlink:

Example 1
Example 2

An Efficient Computing Framework for Iterative Computation

The most important thing in the intermediate function library is the Iterative Communication and Computation Queue (ICQ). This is an iterative communication and computing framework summarized for iterative computation scenarios. It integrates memory cache technology and memory data communication technology. We abstracted each iteration step into a queue composed of a series of ComQueueItems, which are communication and computation modules. This component offers a significant performance improvement over Flink’s basic IterativeDataSet, with the same volume of code and improved readability.

There are two types of ComQueueItems, including computation and communication. ICQ also provides the initialization function to cache DataSet to the memory. The cache formats include Partition and Broadcast. The former caches DataSet shards to the memory, whereas the latter caches the entire DataSet to the memory of each worker. By default, the AllReduce communication model is supported. In addition, ICQ allows you to specify iteration termination conditions.

The code for iterative development of LBFGS algorithms based on ICQ is as follows:

DataSet <Row> model = new IterativeComQueue()
.initWithPartitionedData(OptimVariable.trainData, trainData)
.initWithBroadcastData(OptimVariable.model, coefVec)
.initWithBroadcastData(OptimVariable.objFunc, objFuncSet)
.add(new PreallocateCoefficient(OptimVariable.currentCoef))
.add(new PreallocateCoefficient(OptimVariable.minCoef))
.add(new PreallocateLossCurve(OptimVariable.lossCurve, maxIter))
.add(new PreallocateVector(OptimVariable.dir, new double[] {0.0, OptimVariable.learningRate}))
.add(new PreallocateVector(OptimVariable.grad))
.add(new PreallocateSkyk(OptimVariable.numCorrections))
.add(new CalcGradient())
.add(new AllReduce(OptimVariable.gradAllReduce))
.add(new CalDirection(OptimVariable.numCorrections))
.add(new CalcLosses(OptimMethod.LBFGS, numSearchStep))
.add(new AllReduce(OptimVariable.lossAllReduce))
.add(new UpdateModel(params, OptimVariable.grad, OptimMethod.LBFGS, numSearchStep))
.setCompareCriterionOfNode0(new IterTermination())
.closeWith(new OutputModel())
.setMaxIter(maxIter)
.exec();

Use Cases

Case 1: Sentiment Analysis

Dataset: https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/ChnSentiCorp_htl_all/ChnSentiCorp_htl_all.csv

Data preview:

First, we define a pipeline that contains components for missing value filling, Chinese word segmentation, stopword filtering, text vectorization, and logistic regression.

Next, we use the pipeline defined above to perform model training, batch prediction, and result evaluation.

By using different text vectorization methods and classification models, we can compare the model results in a quick and intuitive manner:

Case 2: FTRL Online Learning

Dataset: https://www.kaggle.com/c/avazu-ctr-prediction/data

Data preview:

First, we build a pipeline for feature engineering, which is composed of two components connected in series: standardization and feature hashing. Through training, we obtain a pipeline model.

Second, we use a logistic regression component for batch training to obtain an initial model.

Next, we use the FTRL training component for online training and the FTRL prediction component for online prediction.

Finally, we use a binary classification evaluation component for online evaluation.

The evaluation results are displayed in Notebook in real time, allowing developers to monitor the model status in real time.

Our Future Plans

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store