All You Need to Know About MaxCompute
By Sheng Yuan
Users who are new to MaxCompute usually find it hard to quickly and comprehensively learn MaxCompute when they have so many product documents and community articles to choose from. At the same time, many developers with big data development experience hope to establish some associations and mappings between MaxCompute capabilities and open-source projects or commercial software based on their background knowledge, to determine whether MaxCompute meets their own needs. They also want to learn and use MaxCompute in an easier way by utilizing their existing experience.
This article describes MaxCompute in different topics from a broader perspective so that readers can quickly find and read their required information about MaxCompute.
MaxCompute is a big data computing service that provides a fast and fully hosted PB-level data warehouse solution, allowing you to analyze and process massive data economically and efficiently.
As indicated by the first part of the MaxCompute definition, MaxCompute is designed to support big data computing. In the meantime, it is a cloud-based service product. The latter part of the definition indicates the application scenarios of MaxCompute: large-scale data warehouses and processing and analysis of large amounts of data.
From the definition alone, we can’t understand what computing capabilities and what kind of servitization MaxCompute actually provides. The phrase “data warehouse” in the definition shows that MaxCompute can process large-scale structured data (PB-level data as described in the definition). However, although the definition shows that MaxCompute can “analyze and process massive data”, it remains to be verified whether it can process unstructured data or provide other complex analysis capabilities in addition to the common SQL analysis capabilities.
With these questions in mind, we will continue our MaxCompute introduction and hope that we can find answers to these questions later.
Before introducing the features of MaxCompute, we start with the overall logical architecture to give you an overview of MaxCompute.
MaxCompute provides a cloud-native and multi-tenant service architecture. MaxCompute computing services and service interfaces are pre-built on the underlying large-scale computing and storage resources, and a full set of security control methods and development kit management tools are provided. MaxCompute comes out-of-the-box.
In Alibaba Cloud Console, you can activate this service and create MaxCompute projects in minutes, without having to activate underlying resources, deploying software and maintaining infrastructure. Versions are upgraded and problems are fixed automatically (by the professional Alibaba Cloud team).
- MaxCompute supports large-scale computing and storage and is suitable for storage and computing from the TB level up to the EB level. The same MaxCompute project supports the data scale requirements from entrepreneurial teams to unicorns.
- MaxCompute features distributed data storage and multi-copy redundancy. For data storage, only operation interfaces for tables are made available and access interfaces for file systems are not provided.
- MaxCompute uses the self-developed data storage structure and column-oriented table data storage. Data is highly compressed by default. MaxCompute will be compatible with the Ali-ORC storage format in ORC later.
- Foreign tables are supported. Data stored in OSS and Table Store can be mapped into foreign tables.
- Storage partitions and buckets are supported.
- The underlying layer is the Apsara Distributed File System developed by Alibaba itself rather than HDFS. However, you can use HDFS to help you understand the file system structure under a specific table and the task concurrency mechanism.
- Storage and computing are decoupled. You do not need to unnecessarily increase computing resources simply to handle storage requirements.
Multiple Computational Models
Note that in traditional data warehouse scenarios, most data analysis tasks are actually done by combining SQL and UDFs. As enterprises attach more importance to data value and more roles begin to use data, enterprises require more computing capabilities to meet needs of different users in different scenarios.
In addition to the SQL data analysis language, MaxCompute supports multiple computational models based on a unified data storage and permission system.
It fully supports TPC-DS and is highly compatible with Hive. Developers with a Hive background can get started immediately. The performance in large-scale data scenarios is especially powerful.
- It is a self-developed compiler that is characterized by more flexible language feature development, faster iteration, more flexible and efficient syntax and semantics checks.
- It is a cost-based optimizer that is more intelligent, more powerful, and more suitable for complex queries.
- LLVM-based code generation makes the execution process more efficient.
- It supports complex data types (array, map, struct).
- It supports UDF, UDAF, and UDTF in Java and Python.
- Syntax: Values, CTE, SEMIJOIN, FROM inversion, Subquery Operations, Set Operations (UNION/INTERSECT/MINUS), SELECT TRANSFORM, User Defined Type, GROUPING SET (CUBE/rollup/GROUPING SET), script running modes, and parameterized view
- Foreign tables are supported (foreign data sources and StorageHandler supports unstructured data).
- MapReduce programming interfaces are supported. (MaxCompute provides optimized and reinforced MapReduce for MaxCompute and MapReduce versions that are highly compatible with Hadoop).
- The file system is not exposed and the input and output are all tables.
- Jobs are submitted by using the MaxCompute client tool and Dataworks.
- MaxCompute Graph is a processing framework designed for iterative graph computing. Graph computing jobs use graphs to build models. Graphs are composed of vertices and edges with values.
- MaxCompute Graph iteratively edits and evolves graphs to obtain analysis results.
- Typical applications include PageRank, the single-source shortest path algorithm, and the K-means clustering algorithm.
- Use the Java SDK interface provided by MaxCompute Graph to write graph computing applications and submit tasks by using the jar command in the MaxCompute client tool.
It uses familiar Python and the large-scale computing capability of MaxCompute to process MaxCompute data.
PyODPS is a Python SDK for MaxCompute. It also provides the DataFrame framework and Pandas-like syntax. PyODPS can use the powerful processing capabilities of MaxCompute to process ultra-large-scale data.
- PyODPS provides access to ODPS objects such as tables, resources, and functions.
- It submits SQL through run_sql/execute_sql.
- PyODPS allows uploading and downloading data by using open_writer, open_reader or native tunnel APIs.
- PyODPS provides the DataFrame API, which provides interfaces similar to Pandas interfaces and can fully utilize the computing capability of MaxCompute for DataFrame computing.
- PyODPS DataFrame provides many Pandas-like interfaces, but extends the syntax of these interfaces. For example, the MapReduce API is provided to adapt to the big data environment.
- map, apply, and map_reduce make it very convenient to write functions and call function methods in the client. Users can invoke third-party libraries such as Pandas, SciPy, scikit-learn, and NLTK.
MaxCompute provides the “Spark on MaxCompute” solution that allows MaxCompute to provide open-source Spark computing services and enables the Spark computing framework to be provided on a unified computing resource and dataset permission system. MaxCompute allows users to submit and run Spark jobs by using their familiar development method.
- Multiple versions of native Spark jobs: Both Spark1.x and Spark2.x jobs are supported.
- Experience with open-source systems: It provides Spark-submit (Currently spark-shell and spark-SQL interaction is not supported) and Native Spark WebUI is provided for users to view information.
- Accessing external data sources like OSS, Table Store, and databases enables more complex ETL. It is supported to process unstructured OSS data.
- Spark can be used to perform machine learning targeting internal and external MaxCompute data and expand application scenarios.
Interactive Analysis (Lightning):
The interactive query service of MaxCompute has the following features:
- Compatibility with PostgreSQL: MaxCompute Lightning provides JDBC and ODBC interfaces that are compatible with the PostgreSQL protocol. Tools or applications based on PostgreSQL databases can easily be connected to MaxCompute projects by using the default driver. It supports connection and access to mainstream BI and SQL client tools such as Tableau, FineBI, Navicat, and SQL Workbench/J.
- Significantly improved query performance: MaxCompute Lightning improves the query performance for a certain scale of data. Query results are available in seconds. It supports scenarios such as BI analysis, Ad-hoc, and online services.
- MaxCompute provides hundreds of built-in machine learning algorithms. Currently, the machine learning capability of MaxCompute is enabled by PAI, which also provides elastic prediction services that supports deep learning frameworks, Notebook development environments, GPU computing resources, and online model deployment. PAI is seamlessly integrated with MaxCompute regarding projects and data.
Table of Comparison
The following comparison and description is provided to help readers understand the main MaxCompute features, especially those who have experience in the open-source community.
Frequently Asked Questions
What are the relationship and differences between DataWorks and MaxCompute?
These are two different products. MaxCompute is a computing service for data storage, processing, and analysis. DataWorks is a big data IDE toolkit that integrates features such as data integration, data modeling and debugging, job orchestration and maintenance, metadata management, data quality management, and data API services. Their relationship is similar to that between Spark and HUE. I hope this is an accurate analogy.
I am interested to try MaxCompute. Is it expensive?
No. Actually, the cost is very low. MaxCompute provides the Pay-By-Job billing method, by which the cost of a single job depends on the size of the data that job processes. Activate the Pay-As-You-Go billing plan and create a project. You can try MaxCompute after creating a table and uploading test data by using the MaxCompute client tool (ODPSCMD) or DataWorks. If it’s dealing with small amounts of data, USD 1.50 is enough to try MaxCompute for quite a while.
MaxCompute also has the exclusive resource model. Subscription is provided for this model in consideration of the cost.
In addition, MaxCompute will soon release the “Developer Edition”, giving developers a certain free quota each month for development and learning.
Currently MaxCompute only exposes tables. Can it process unstructured data?
Yes. Unstructured data can be stored on OSS. You can implement the logic that process unstructured data into structured data by using foreign tables and customizing Extractor. You can also useSpark on MaxCompute to access OSS, extract and convert files under the OSS directory by using the Spark program, and then write results into MaxCompute tables.
Which data sources can MaxCompute integrate data from?
You can integrate various offline data sources on Alibaba Cloud by using the Dataworks data integration service or DataX, such as databases, HDFS, FTP.
You can also use the command or SDK to batch upload and download dataSDK through MaxCompute Tunnel/SDK.
Streaming data can be stream-written into Datahub and archived into MaxCompute tables by using the Flume/logstash plug-in in MaxCompute.
Alibaba Cloud SLS and DTS service data can also be written into MaxCompute tables.
This article describes the basic concepts and features of MaxCompute and compares it with common open-source services to help you understand Alibaba Cloud MaxCompute.
Learn more about at MaxCompute at the official website: https://www.alibabacloud.com/product/maxcompute