By Qu Ning, Intelligent Product Expert at Alibaba Cloud
Enterprises today face many challenges in the construction and operation of data technology platforms during digital transformation. The multi-functionality and service-oriented evolution of modern data warehouses makes it possible to solve these challenges. As a data warehouse product in the Apsara Big Data Platform of Alibaba Cloud, MaxCompute has been widely used and trusted by Alibaba and other customers.
This article describes the core capabilities and advantages of Alibaba Cloud MaxCompute. As a unified analysis-oriented data platform, MaxCompute provides enterprises with key business values such as business agility and lower total cost of ownership (TCO) in typical big data analysis platform scenarios. The following content is based on the video of a speech by Qu Ning, Intelligent Product Expert at Alibaba Cloud, and quotes materials from the related PowerPoint slides.
This lecture covers the following topics:
- Apsara Big Data Platform solution
- MaxCompute: SaaS cloud-based enterprise data warehouse
- Value of MaxCompute
- MaxCompute-based solutions and cases
1. Apsara Big Data Platform Solution
Challenges in Enterprise Digital Transformations
Enterprises face many challenges in the process of digital transformation. First, enterprises must meet their business needs based on applications. Enterprises set up many data platforms to meet the business needs of different departments. These platforms are not globally planned, resulting in many data silos. No unified enterprise-grade data access is available between these data silos and the data sharing costs are high, making self-service analytics difficult.
Second, modern enterprises require their business teams to have agile innovation capabilities oriented to rapid business changes, data intelligence R&D capabilities, and user experience-oriented scenario innovation capabilities. This all places higher requirements on IT systems. When an enterprise tries to meet the data analysis requirements of different teams and the development and use requirements of internal users, the development efficiency is low, often taking several months from product planning to final launch.
Third, enterprises used to implement strong control over technology platforms by using on-premises big data platforms. However, during actual Internet data center (IDC) operation and management, the O&M costs, IT talent costs, and brain drain were high, resulting in low service quality for on-premises platforms. Enterprises often plan resources based on business requirements, resulting in the low utilization of daily resources and insufficient resources during business peaks. The on-premises platforms of enterprises are based on open-source components, which may cause system security, data security, and compliance issues and risks.
Modern Big Data Platforms
The multi-functionality and service-oriented evolution of modern data warehouses makes it possible to solve the challenges enterprises face in their digital transformations. Modern big data platforms are evolving in two directions. First, with the rise of cloud computing, modern big data platforms are evolving toward SaaS and provide on-demand computing requirements. Second, traditional data warehouses cannot meet the needs of modern big data. Real-time data warehouses must be built to analyze unstructured data at low costs and explore deeper value by using artificial intelligence (AI) capabilities.
Apsara Big Data Platform Solution
The big data platform solution is used to create various data applications for different product portfolios. The Apsara Big Data Platform solution is applicable to data-driven operations in Internet industries like e-commerce, gaming, and social networking, including intelligent recommendation, log analysis, business operation analysis, user profiling, data governance, business dashboards, and search. This big data platform features Alibaba Cloud best practices. It adopts advanced technologies, reduces costs, improves efficiency, and benefits from high value-added businesses. As a flagship product in the Apsara Big Data Platform solution, MaxCompute assumes a core role.
2. MaxCompute: SaaS Cloud-based Enterprise Data Warehouse
Benefits of MaxCompute
MaxCompute is positioned as a SaaS cloud-based enterprise data warehouse. It is hosted in Alibaba Cloud to create a large-scale resource pool. Alibaba Cloud deploys and manages the resource pool and provides external API operations. Then, you can access it by performing different search operations on the client. You can skip the activation step and use MaxCompute directly. MaxCompute has an ultra-large resource pool that features on-demand usage and high elasticity. MaxCompute uses a storage-computing separation architecture to provide structured storage and on-demand computing resources. It provides good scalability at a low cost.
MaxCompute supports a variety of service-oriented scenarios. In the marketing data analysis scenario, it collects and analyzes user behavior data, creates profiles, and adds tags to provide more services. MaxCompute also collects and queries online operation information in real time and adjusts operation policies. It creates data warehouses for each industry to build more data applications.
MaxCompute Technical Features
- MaxCompute is a fully-managed serverless online service. By using it, you do not have to activate and manage resources. You can enjoy almost unlimited computing resources. Alibaba Cloud implements version upgrades, resource scaling, and troubleshooting, which further reduces O&M investment.
- MaxCompute provides optimal elasticity and scalability. Due to the separation of storage and computing, MaxCompute supports data scale-out from terabytes to exabytes. This allows enterprises to store all their data assets on a single platform for associated analysis and eliminate data silos. Serverless resources are allocated based on changes in peak and off-peak hours, which allows for automatic scale-out. MaxCompute provides powerful computing capabilities. A single job can obtain thousands of cores in seconds. MaxCompute can process exabytes of data.
- MaxCompute integrates data exploration capability. It is deeply integrated with Alibaba Cloud warehouses. By default, MaxCompute integrates access analysis for data lakes such as Object Storage Service (OSS) and can process unstructured or open format data. MaxCompute also supports external table mapping and direct Spark access for data lake analysis. Based on the mappings between data warehouses and external tables, MaxCompute implements data lake analysis and association analysis of data warehouses by using the same data warehouse service and user interface.
- Traditional business intelligence (BI) capabilities can no longer meet business needs. More enterprises need AI capabilities to integrate data into platforms to support more scenarios. MaxCompute is seamlessly integrated with Machine Learning Platform for Artificial Intelligence to support the integration of BI and AI and provide powerful machine learning capabilities. You can use Spark-ML for intelligent analysis and use third-party machine learning libraries of Python.
- Real-time analysis is a hot topic. MaxCompute also supports real-time streaming data writing (in tunnels) and analysis in data warehouses. It is deeply integrated with major off-premises streaming services to access streaming data from various sources. MaxCompute supports high-performance elastic concurrent queries within seconds, which meets the requirements of near-real-time analysis scenarios.
- MaxCompute supports multiple computing engines and provides complete Spark features by using the built-in Apache Spark engine. It is deeply integrated with MaxCompute computing resources, data, and permission system.
- MaxCompute provides unified and various computing capabilities. It provides you with many offline computing modes, such as MapReduce (MR), directed acyclic graph (DAG), SQL, machine learning, and graph computing as well as many real-time computing modes, such as stream processing, memory computing, and iterative computing. It covers common relational big data, machine learning, unstructured data processing, and graph computing.
- Data mid-ends often need to share data. Everyone in an enterprise can retrieve the data assets of the enterprise and knows what data is available. In addition, security compliance permission control allows everyone to obtain the data assets for further development. In this case, a unified metadata view must be provided by a data mid-end. MaxCompute provides unified tenant-level metadata to allow an enterprise to obtain the complete enterprise data catalog. It also establishes connections between data warehouses and external data sources by using external tables. In this way, the data mid-end can provide a unified data view without collecting all data, which meets data sharing requirements.
- MaxCompute is a complete service rather than a simple computing engine. It ensures 99.9% service availability based on the service level agreement (SLA) to support self-service O&M and automated O&M and provide comprehensive fault tolerance for software, hardware, network, and human errors.
3. The Value of MaxCompute
Cloud-native Application Scenarios
Serverless has been a trend in data platforms. Serverless resources are distributed on demand, and its highly scalable capabilities provide an optimal solution for data mid-end problems. MaxCompute is a serverless cloud-native data warehouse service, which provides an ultra-large-scale resource market. The resource pool is transparent to you, and you only need to activate projects, build data warehouses, perform data modeling, and analyze data in the projects at the logic layer. MaxCompute is an agile service model, which significantly reduces the difficulty of data platform use and reduces the data processing period from months to days, accelerating value creation.
For example, the time from activating MaxCompute to running the first SQL query statement based on a public dataset is only 2 minutes.
Simply log on to the DataWorks console, and in the left-side navigation pane, click Workspaces. On the page that appears, click Create Workspace. In the Create Workspace wizard, set Workspace Name and click Next in the Basic Settings step. In the Select Engines and Services step, select Pay-As-You-Go next to MaxCompute and click Next. In the Engine Details step, set Instance display name, MaxCompute Data Type Edition, and Account for Accessing MaxCompute, and click Create Workspace.
After the workspace is created, go to the DataWorks data development portal and start the first SQL statement. MaxCompute provides public datasets for you to query on the Internet. It takes only two minutes from activating the service to the first SQL query. Serverless means more agile business responses and faster innovation through trial and error.
This mode allows start-ups to walk through their business scenarios and verify business value at low costs. MaxCompute also supports agile scenarios for new organizations and departments in large enterprises to perform novel development in independent environments.
Serverless provides simple and powerful computing to adapt to fast-changing business demands, without the need for capacity planning. The left side of the following figure shows that a lot of resources are required to run a complex job. MaxCompute supports data of different scales and provides powerful computing capabilities.
Serverless implements on-demand resource allocation, eliminates the need for resource scaling in clusters or queues, and dynamically allocates appropriate resources for each job. This eliminates the need for capacity planning and avoids a mismatch between resource capacity and business needs.
However, not all jobs require maximum performance, and you must balance cost and performance based on different enterprises, stages, and task types. Different enterprises have different requirements for computing power and preferences. Upon startup, an enterprise does not have much data or costs. However, as the data increases, the number of users grows and the cost will be very high. At this time, MaxCompute can provide elastic computing power for you to use on demand. MaxCompute also provides subscription-based plans to meet routine business needs and keep the expenses stable.
When the enterprise has a stable business scale, you can purchase such plans to prioritize jobs to ensure the stable output of key tasks. You can also purchase storage and computing packages. Temporary query demands are not periodic demands, which have very high requirements for computing power. By connecting multiple computing resources, MaxCompute integrates the subscription-based and pay-as-you-go pricing modes of resource consumption to balance costs and performance. MaxCompute can also preempt and use idle computing resources. The price is 74% lower than that of subscription-based computing resources.
MaxCompute has intrinsic multitenancy support. It ensures isolation among tenants and implements data sharing across businesses and organizations through fine-grained permission control. Different organizations and departments in an enterprise centralize data in the resource pool to implement a unified and complete data asset view. MaxCompute supports cross-project data access authorization. It allows you to efficiently share data within an enterprise at low cost and implement permission control for enterprise data resources for individual users. MaxCompute provides the industry’s most complete security management system. It supports cross-project data security management, fine-grained access control, data encryption, privacy protection, and behavior audit capabilities.
The multitenancy system offers many advantages and requires more security management capabilities. Security incidents occur frequently. How can cloud-based big data services effectively protect the data and services of enterprises? MaxCompute was born in a serverless and multitenancy environment. Alibaba Cloud has built numerous security management mechanisms in MaxCompute, which provide comprehensive and multi-level security management capabilities and continuously protect cloud-based data services. MaxCompute ensures the security of infrastructure management. It also provides many management functions for access control and authorization, data security, risk control, and multitenancy security isolation. Specifically, MaxCompute provides features such as data encryption, real-time audit, and backup and recovery.
For example, abnormal user behaviors are audited in real time and data is automatically backed up and restored.
tab_dev table has been deleted. In this case, check who deleted the table on the historical event query page, as shown in the following figure.
In the following figure, you can find out who deleted the tab_dev table and when and how it was deleted for subsequent tracing. MaxCompute provides real-time audit capabilities.
Continuous Backup and Recovery
After data is lost, you must retrieve important data. MaxCompute provides automatic backup capabilities at the service level. You can run the restore command to retrieve the lost data, as shown in the following figure.
Unified Analysis-oriented Data Platform
MaxCompute is a unified analysis-oriented data platform, which meets data needs, simplifies the data platform architecture, and accelerates the acquisition of in-depth business insights. MaxCompute provides more real-time data insight capabilities. It integrates message services at the product level, collects custom Datahub logs, and obtains and analyzes events in real time. MaxCompute provides many data formats, but some data formats cannot be used in the same place. MaxCompute can provide the federated query capability and interact with database systems. MaxCompute provides the same data processing platform for processing data loaded from different databases. MaxCompute is an analysis-oriented data platform. It seamlessly integrates with Machine Learning Platform for Artificial Intelligence, provides built-in support for mainstream machine learning frameworks, and acquires in-depth insights without the need to move data.
MaxCompute is fully integrated with Spark to support multiple engines for a single set of data, allowing you to use mainstream and familiar computing engines on fully-managed unified data platforms in more computing scenarios. Many users prefer the Spark engine. MaxCompute Spark is the Apache Spark computing framework provided by MaxCompute. It is fully compatible with Spark’s APIs, applications, and ecosystem tools, which share the same data storage, computing resources, and database management system. MaxCompute users can use Spark for application development in a unified data storage and permission control system.
Modern Data Warehouse and Data Lake Solution
A data lake is a center for off-premises data storage and exchange. For example, a large amount of unstructured data is stored in OSS. OSS can be connected to the off-premises data lake by using external tables or Spark to query federated external tables and load the data in OSS to the data warehouse by using the load command. MaxCompute connects to the various external data sources of an enterprise and uses a unified data management system (unified metadata management system of MaxCompute) to seamlessly access and process data from multiple sources across isolated storage systems in a unified computing environment.
Integrated Data Platform for BI and AI
Big data platforms provide data for data preprocessing in real AI scenarios. How can we conduct both BI analysis and AI analysis to fully explore the value of data in the unified data asset and security systems of an enterprise? Ideally, computing and data are decoupled and AI analysis is implemented for big data on a platform but no data is moved. MaxCompute is integrated with Machine Learning Platform for Artificial Intelligence to process data and support intelligent analysis.
Technical Base of High Performance and Low Costs
Data platforms have been focusing on performance, costs, and efficiency. On-premises platform costs include one-time hardware and software costs, scaling costs, management costs, and O&M costs. MaxCompute costs include cloud service costs and very low system management costs. Therefore, MaxCompute can greatly reduce costs in the early stage and perform quick value verification. MaxCompute also delivers excellent performance for data volumes of 30 TB and 100 TB. Compared with an on-premises Hadoop system, MaxCompute reduces costs by half and doubles performance. As a result, it has been praised by the Transaction Processing Performance Council (TPC). MaxCompute provides high-performance and cost-effective big data analysis services. Compared with IDC platforms, MaxCompute reduces real TCO costs by one-third.
MaxCompute provides strong computing power in offline warehouses and accelerates elastic concurrent queries in BI and integrated analysis scenarios. As shown in the preceding figure, MaxCompute performs well in various test sets.
Open Ecosystem Data Platform
Instead of using an independent technology platform, MaxCompute requires support from peripheral systems and must be integrated with enterprise environments. MaxCompute provides an open ecosystem and can be integrated with multiple services, including open first-party service interfaces such as the MaxCompute Studio IDE, JDBC, SDK, open-source Spark Connector, open-source Kafka Connector, and MaxCompute Migration Assist (MMA). In addition, MaxCompute can be used in an enterprise’s existing IT environments and deeply integrated with Alibaba Cloud services, such as DataWorks, Machine Learning Platform for Artificial Intelligence, Datahub, Data Transformation Services (DTS), Log Service (SLS), and Message Queue for Apache Kafka to minimize data link integration work. MaxCompute also integrates many third-party services, such as Tableau, R, Python, and Python SDK, to improve development efficiency. Alibaba Cloud provides a complete set of big data solutions, including data collection, real-time and offline integrated computing, and data application presentation. MaxCompute can be used as the foundation for data warehouses to support fast integration with multiple Alibaba Cloud services, meeting the intelligent application requirements of enterprises.
Enterprise-grade Data Governance Platform
When an enterprise grows to a large scale, it will inevitably encounter data governance problems. Data growth is not directly related to business growth. Data grows exponentially, whereas businesses grow more smoothly. The cost of data governance increases after data platforms are interconnected. This solution provides Alibaba best practices for big data governance and data discovery mechanisms. It supports unified metadata collection, data asset catalog construction, data detection and analysis, federated query, and resource optimization. This allows enterprises to detect the value of data, effectively manage metadata, safely produce data, and intelligently minimize data costs. Take data storage as an example. A lot of data is stored in data warehouses but cannot be used. Some jobs do not perform computing, whereas other jobs perform repeated computing. Alibaba automatically collects first-hand data by using engines to provide an optimized view in multiple fields and from multiple perspectives.
4. MaxCompute-based Solutions and Cases
MaxCompute Resolves Pain Points for On-premises Data Platforms
On-premises platforms face many challenges, including high construction costs, insufficient scalability and elasticity, low resource utilization, and high O&M costs. Off-premises MaxCompute services can solve most of these pain points. Therefore, MMA-based cloud migration has become a mature solution, including data migration evaluation, data migration, and job migration and conversion. MMA allows enterprises to use the features and advantages of off-premises MaxCompute services by migrating their platforms to the cloud.
Solution for Migrating Big Data to the Cloud
Big data site migration is the integration and upgrade of the cloud ecosystem, and Alibaba Cloud provides a complete Apsara Big Data Platform solution. A leading customer in the mother and infant industry encountered many difficulties in building data platforms, including high utilization of clusters, poor performance, and urgent needs for comprehensive big data governance. In addition, the customer hoped to reduce the annual big data costs of IDCs. By using the solution for efficiently migrating big data platforms to the cloud at lower cost, Alibaba Cloud migrated to MaxCompute, Realtime Compute, and DataWorks. This improved the performance of some tasks by more than 10 times. The open-source data format was converted, and storage on the on-premises Hadoop system is reduced from 3 PB to 900 TB. The real-time data processing capabilities of Apache Flink were used to make the scenarios of the mother and infant customer into real-time scenarios (obtaining real-time behavior based on user ID dimensions and content types, obtaining the real-time group chart IDs of users, and obtaining real-time publication information of articles). This allowed the customer to implement real-time recommendations to increase conversion rates. The overall costs of big data platforms were reduced by more than 30%.
Solution for Intelligent Real-time Data Warehouses
This solution is applicable to real-time queries on large-scale data in Internet industries such as e-commerce, gaming, and social networking. Based on this solution, Alibaba Cloud real-time data warehouses are seamlessly integrated with offline data warehouses in an end-to-end manner. This solution can provide a cost-effective combination of real-time computing and offline computing for one storage system. A video enterprise develops tag data for target users, develops real-time insights into user profiles, and implements real-time video recommendations based on the combination of MaxCompute, Realtime Compute, and Hologres. MaxCompute’s out-of-the-box (OFTB) availability, complete ecosystem, robust performance, and elastic resources allow customers to balance their cost and elasticity needs. MaxCompute supports data layering, fraud prevention, computing optimization, and storage optimization.
MaxCompute supports the pay-as-you-go billing method. We recommend that you select the pay-as-you-go billing method initially to match resources to your service needs. You can activate MaxCompute and will not be charged if you do not use it. When your business matures, we recommend that you select the subscription billing method. You can use this billing method to reduce the unit price, enjoy more discounts, and control relevant budgets and costs.
You can still enjoy outstanding elastic computing resources in pay-as-you-go mode. The resource pool is shared. You can preempt computing tasks on demand but cannot specify the resource usage or limit. Only SQL, MR, Spark, and Interactive Analytics computing tasks are billed. You are only billed for the resources used to store tables based on the compressed data volume. Storage resources are shared, automatically scaled, and unlimited. MaxCompute tables and resources consume storage resources. Data upload to MaxCompute is free. You will be billed for downloading data from MaxCompute over the Internet in pay-as-you-go billing mode. Computing resources of the subscription-based standard edition include computing units and non-reserved computing resources. Subscription-based plan editions are billed for computing and storage. If you purchase a plan, you do not need to pay for storage resources.
As a data warehouse product in the Apsara Big Data Platform, MaxCompute has been widely used and trusted by Alibaba and other customers. In addition, MaxCompute can meet the various digital needs of modern enterprises. By using MaxCompute, enterprises can build high-performance and agile data platforms at low costs. MaxCompute has an ultra-large data storage capacity. It integrates the multi-source data of an enterprise to unify data assets and enable each employee of the enterprise to use and analyze data in a secure and shared environment. This empowers data-driven organizational transformation. MaxCompute is an ideal technical foundation for data warehouses and Data Mid-Ends.
For more information about MaxCompute, visit https://www.alibabacloud.com/product/maxcompute