All You Need to Know About MaxCompute

Key Concepts

Architecture

MaxCompute Features

Data Storage

  • MaxCompute supports large-scale computing and storage and is suitable for storage and computing from the TB level up to the EB level. The same MaxCompute project supports the data scale requirements from entrepreneurial teams to unicorns.
  • MaxCompute features distributed data storage and multi-copy redundancy. For data storage, only operation interfaces for tables are made available and access interfaces for file systems are not provided.
  • MaxCompute uses the self-developed data storage structure and column-oriented table data storage. Data is highly compressed by default. MaxCompute will be compatible with the Ali-ORC storage format in ORC later.
  • Foreign tables are supported. Data stored in OSS and Table Store can be mapped into foreign tables.
  • Storage partitions and buckets are supported.
  • The underlying layer is the Apsara Distributed File System developed by Alibaba itself rather than HDFS. However, you can use HDFS to help you understand the file system structure under a specific table and the task concurrency mechanism.
  • Storage and computing are decoupled. You do not need to unnecessarily increase computing resources simply to handle storage requirements.

Multiple Computational Models

  • It is a self-developed compiler that is characterized by more flexible language feature development, faster iteration, more flexible and efficient syntax and semantics checks.
  • It is a cost-based optimizer that is more intelligent, more powerful, and more suitable for complex queries.
  • LLVM-based code generation makes the execution process more efficient.
  • It supports complex data types (array, map, struct).
  • It supports UDF, UDAF, and UDTF in Java and Python.
  • Syntax: Values, CTE, SEMIJOIN, FROM inversion, Subquery Operations, Set Operations (UNION/INTERSECT/MINUS), SELECT TRANSFORM, User Defined Type, GROUPING SET (CUBE/rollup/GROUPING SET), script running modes, and parameterized view
  • Foreign tables are supported (foreign data sources and StorageHandler supports unstructured data).
  • MapReduce programming interfaces are supported. (MaxCompute provides optimized and reinforced MapReduce for MaxCompute and MapReduce versions that are highly compatible with Hadoop).
  • The file system is not exposed and the input and output are all tables.
  • Jobs are submitted by using the MaxCompute client tool and Dataworks.
  • MaxCompute Graph is a processing framework designed for iterative graph computing. Graph computing jobs use graphs to build models. Graphs are composed of vertices and edges with values.
  • MaxCompute Graph iteratively edits and evolves graphs to obtain analysis results.
  • Typical applications include PageRank, the single-source shortest path algorithm, and the K-means clustering algorithm.
  • Use the Java SDK interface provided by MaxCompute Graph to write graph computing applications and submit tasks by using the jar command in the MaxCompute client tool.
  • PyODPS provides access to ODPS objects such as tables, resources, and functions.
  • It submits SQL through run_sql/execute_sql.
  • PyODPS allows uploading and downloading data by using open_writer, open_reader or native tunnel APIs.
  • PyODPS provides the DataFrame API, which provides interfaces similar to Pandas interfaces and can fully utilize the computing capability of MaxCompute for DataFrame computing.
  • PyODPS DataFrame provides many Pandas-like interfaces, but extends the syntax of these interfaces. For example, the MapReduce API is provided to adapt to the big data environment.
  • map, apply, and map_reduce make it very convenient to write functions and call function methods in the client. Users can invoke third-party libraries such as Pandas, SciPy, scikit-learn, and NLTK.
  • Multiple versions of native Spark jobs: Both Spark1.x and Spark2.x jobs are supported.
  • Experience with open-source systems: It provides Spark-submit (Currently spark-shell and spark-SQL interaction is not supported) and Native Spark WebUI is provided for users to view information.
  • Accessing external data sources like OSS, Table Store, and databases enables more complex ETL. It is supported to process unstructured OSS data.
  • Spark can be used to perform machine learning targeting internal and external MaxCompute data and expand application scenarios.
  • Compatibility with PostgreSQL: MaxCompute Lightning provides JDBC and ODBC interfaces that are compatible with the PostgreSQL protocol. Tools or applications based on PostgreSQL databases can easily be connected to MaxCompute projects by using the default driver. It supports connection and access to mainstream BI and SQL client tools such as Tableau, FineBI, Navicat, and SQL Workbench/J.
  • Significantly improved query performance: MaxCompute Lightning improves the query performance for a certain scale of data. Query results are available in seconds. It supports scenarios such as BI analysis, Ad-hoc, and online services.
  • MaxCompute provides hundreds of built-in machine learning algorithms. Currently, the machine learning capability of MaxCompute is enabled by PAI, which also provides elastic prediction services that supports deep learning frameworks, Notebook development environments, GPU computing resources, and online model deployment. PAI is seamlessly integrated with MaxCompute regarding projects and data.

Table of Comparison

Frequently Asked Questions

Conclusion

--

--

--

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Why I am married to python but still in love with c++?

Why I am married to python but still in love with c++? — Ayush Singh

Insights from Alibaba Cloud Experts: 3 Ways to Deal with High Concurrency

Fast and Accurate Incremental Feedback for Students’ Software Tests Using Selective Mutation…

Four subplots, each with four box plots showing operator subset cost, and line plots showing operator subset accuracy.

Getting Started with Kubernetes

Alibaba Cloud and Palo Alto Networks Joint solutions for Remote Educations

Certified Kubernetes Administrator (CKA 1.19) — Preparation Guide

Elastic Beanstalk overview

Advanced Apache Flink Tutorial 1: Analysis of Runtime Core Mechanism

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com

More from Medium

Access AWS s3 bucket files and integration with the Test framework

Power Redundancy: UPS & Backup Generators

Zookeeper Leader Election Simplified

Hive Performance Tuning