New Horizons of the MaxCompute Ecosystem: Rich, Connected, and Integrated
During the session on the big data ecosystem at the 2019 Apsara Conference held in Hangzhou, Li Ruibo, Senior Staff Engineer at Alibaba Cloud, delivered a speech titled “New Horizons of MaxCompute Ecosystem: Rich, Connected, and Integrated.” Focusing on the MaxCompute ecosystem (previously known as ODPS), this article elaborates on three topics: richer and better interface and utilities, connection of all off-premises data, and integration of custom engines. The main content includes the official support for MaxCompute in Tableau, the improved experience provided by MaxCompute Migration Assist (MMA) and command line tools, the progress of Big Data and AI in the Python ecosystem, and the ability to integrate custom engines.
The following introduces highlights from the Li Ruibo’s lecture.
The MaxCompute Ecosystem
MaxCompute is Alibaba’s fully managed flagship big data service. The MaxCompute ecosystem can be viewed from different perspectives. The right half of the following figure presents MaxCompute as a service, which is surrounded by related tools and interfaces such as Java SDKs and JDBC drivers. Together, they constitute an ecosystem. In terms of feature, MaxCompute can be divided into two parts: computing and storage. Data in the lower-left corner of the following figure represents storage, while engines in the upper-left corner represent computing. As an all-in-one big data solution, MaxCompute provides both storage and computing capabilities. The entire circle represents the MaxCompute ecosystem.
Richer and Better Interfaces and Utilities
MaxCompute is a cloud-native big data service. A service is essentially a set of RESTful APIs that interact with services externally through SDKs. This interaction can be divided into two parts. One is control. For example, you can use XML to submit jobs and obtain job statuses. The other is the data tunnel. PB is heavily used when a job needs to be obtained on the server after the job is run. SDKs make use of XML-to-Java object binding technology and have always used the Java Architecture for XML Binding (JAXB) technology that comes with JDK. However, with the iteration of Java, JAXB will exit from the standard Java. The programming features of Java EE require users to reference packages to complete a project, which is very inconvenient. The same is true for Protobuf because cross-version compatibility has always been a problem. When you need to integrate MaxCompute into a specific framework, you may find it difficult to cope with the conflict between dependent libraries. The latest SDK version contains a complete implementation of XML binding, which is completely independent of JAXB and gets ready for Java 9, 10, and 11. This reduces the dependency on Java SDKs. As the foundation for upper-layer tools, Java SDKs can provide benefits such as command-line tools and JDBC.
After working on the Java SDK, the MaxCompute team also made improvements to the JDBC driver. In addition to supporting all the new data types of MaxCompute, Alibaba has also developed a MaxCompute connector for Tableau to connect to data sources based on the latest JDBC driver. Tableau is an industry leader in interactive data analysis and has high standards for the quality of data connectors. The JDBC-based MaxCompute connector has passed Tableau’s TDVT test, with about 700 test cases recorded by Tableau. Starting from Tableau 2019.4, MaxCompute can be found in Tableau’s list of data sources.
In terms of tools, the console has made two significant improvements to the multi-line editing experience and support for 4 Bytes UTF-8. In addition, in 2019, Alibaba developed MMA, a toolchain that can fully express the intention of website migration, as shown in the preceding figure. The procedure starts with data migration, then goes to network condition detection, data source probing, migration plan generation, and ends with execution. This toolchain provides flexible interfaces to meet various special needs in the website migration process. For example, you can change the name if desired. In terms of execution, it supports both push mode and pull mode and allows for synchronous integration with DataWorks data, providing a better user experience. This migration toolkit has been adopted by our clients, including 58daojia.com and baihe.com.
The Python ecosystem is an indispensable part of AI. When we mention the MaxCompute ecosystem, we are talking not just about Python SDKs, but also compatibility with DataFrame. You can use DataFrame-like APIs to perform operations on MaxCompute, which can meet the needs of many scenarios, but not all of them. As shown in the preceding table, almost no index-related operations are supported. Therefore, in 2018, Alibaba open-sourced Mars, a Tensor-based distributed engine to improve the scientific computing of big data. Mars can also serve as an ideal bridge between big data platforms and other systems, such as machine learning and graph computing systems.
The preceding figure shows an example of using Mars, in which Mars serves as the middle layer and a series of new features are used. On the left is a graph computing system, which has its own storage and expression and can generate high-dimensional arrays. It then writes Tensor to the next step based on its own index. Finally, the data enters TenserFlow for deep learning training. During this process, the big data system must be involved to add dimensions.
The preceding figure shows the work of Mars + Intel® Optane™ DC Persistent Memory (DCPM) under certain workloads. The sizes of input data are 250 GB and 500 GB respectively. The workload size is calculated by multiplying two matrixes. Therefore, the scale of the problem grows as the square of the matrix size. This means that, when the matrix size doubles, the computing scale increases four-fold. With DCPM, performance will improve as the problem scale increases. For instance, in a 500 GB scenario, the performance can be improved by 25%.
Connection of All Off-premises Data
Inside Alibaba, 99% of the structured data is stored in MaxCompute, which supports exabyte-level storage. Some users in the public cloud may have both structured data and unstructured data scenarios, such as OSS and OTS shown in the figure. MaxCompute has long supported external tables, but only in OSS and OTS. This is due to the network infrastructure. MaxCompute, OSS, and OTS are all basic cloud-native services that can be accessed and connected over the network anytime and anywhere. It is the services in the VPC that have been missing for a long time. These are shown on the right side of the preceding figure. These services are not cloud-native and cannot be accessed over the public network. However, as the need to access these services grows, MaxCompute has already begun to support VPC data access in the form of external tables, Spark, Flink, and SQL UDF.
The preceding figure shows an example of Spark accessing an ApsaraDB for RDS (RDS) instance. It takes two steps to access an RDS instance in VPC from Spark on MaxCompute. First, add a whitelist for this RDS instance. The whitelist of IP addresses that MaxCompute can access is shown on the left. Second, set the necessary parameters required in the VPC before starting the Spark job, as shown on the right.
The VPC connectivity enables more application scenarios, such as directly reading real-time data streams from Kafka, federated computing of dimension tables in RDS instances, and using ApsaraDB for Redis (Redis) to boost engine performance. When the network and bandwidth of the VPC connection are sufficiently stable, Alibaba may be able to implement a hybrid cloud architecture.
Integrated Custom Engines
The preceding figure shows the federated computing platform of MaxCompute. The computing capabilities of MaxCompute can be integrated into various open-source engines. This allows for unified management and the use of unified resource pools, enabling jobs written by various open-source engines to run on MaxCompute.
In April 2019, Alibaba officially launched Spark on the public cloud and ran pilot programs in several regions inside China, which were well-received by users. Now, MaxCompute is available out-of-the-box in all regions. Initially, MaxCompute was an offline data system specializing in running large-scale jobs. Now it runs jobs as a federated computing platform and has introduced Spark and Flink to support stream computing.
At first, to enable MaxCompute Spark to run on MaxCompute, we made some modifications to the native Spark. Some users asked whether their modified Spark engines would be compatible with ours. Therefore, we wanted to ensure that users would not have to modify the core code of their open-source engines. This means that user-modified open-source engines could directly run on Alibaba’s platform without any further modification. As shown in the preceding figure, Spark is segmented into several parts, including the engine core, the part where the engine and the MaxCompute data source connect, and the part where the engine and the resources connect. For data, we abstracted out a set of Cupid SDKs to make the development of components to connect to the Alibaba platform easier for our users. In terms of resources, an in-place replacement of YarnClient is provided to connect to the scheduling interface of Alibaba’s Apsara distributed operating system. Currently, the native JAR packages of Spark 2.4 and Flink 1.9 can be directly used without any modifications.
Therefore, by providing the capability of customizing engines, MaxCompute allows you to download a native or custom engine package, create a custom engine, and then submit jobs to the custom engine.