By Yun Lang, Senior Product Expert at Alibaba Cloud
How do we define measurement for big data? How do we evaluate computing power for a computing platform?
In recent years, we have worked with many international standardization organizations to do a lot of benchmark tests on MaxCompute, including Sort Benchmark, TPC-H, and TPC-DS at earlier stages, and then TPCx-BB (BigBench) last year. We discovered that TPCx-BB is the standard for measuring the computing power of big data. During a TPCx-BB test, we need to take into account all types of data, including semi-structured, structured, and unstructured data.
We also need to cover the three basic analysis scenarios: common modeling such as static modeling, data analysis and mining, and reporting, because structured query language (SQL) is written differently for reporting and static analysis. The types of queries include pure SQL, machine learning, natural language processing (NLP), standard language, and stream computing. The TPCx-BB test on big data must cover the preceding data types, jobs, and techniques.
TPCx-BB Benchmark Test
This test contains 30 test cases that completely cover different data types, techniques, and query types. You can download these 30 test cases to learn how to run a TPCx-BB test. Before TPCx-BB, we ran TPC-DS tests. Based on online transaction processing (OLTP) test scenarios, a TPC-DS test expands our benchmark test scenarios to online analytical processing (OLAP) test scenarios. Then, the TPCx-BB test further extends from OLAP to all the features of big data, such as log files and machine learning. These 30 test scenarios completely cover all of our requirements for big data computing. Most importantly, the test scenarios represent real scenarios to the maximum extent and lead to test results that are closer to actual performance after production. In many cases, PB- and EB-level testing datasets are commonly used. Currently, the official website supports testing datasets of 1 TB, 3 TB, 10 TB, 30 TB, and up to 100 TB.
At the Computing Conference held in Beijing in 2017, Alibaba Cloud was inspired and started to wonder whether a benchmark test for mobile phones could be run on MaxCompute. Then, Alibaba Cloud ran an onsite benchmark test by calling clusters in Beijing and Shenzhen. The test on Beijing clusters produced good results in three aspects, as shown in the following figure. The first aspect shows that testing datasets of the maximum size, 100 TB, were used for the first benchmark test. The second one shows the number of queries that can be run per minute. MaxCompute obtained the world’s highest score of 8,200 QPM. The third one shows the lowest cost at only USD 354.7 per QPM. Therefore, MaxCompute has made a comprehensive breakthrough in data capacity, performance, and price-performance ratio. We have found the standard for measuring big data. In the meantime, we are constantly optimizing our products and expecting to make more breakthroughs.
Evolution of MaxCompute 2.0
In fact, MaxCompute has been developing and evolving in Alibaba for ten years. At the beginning, it basically satisfied the internal business of Alibaba Group, and then it evolved to replace Hadoop. Finally, it became a real data computing middleware and infrastructure. In Alibaba, 99% of stored data and 95% of computing are running on MaxCompute.
Let’s see how MaxCompute has evolved. In the following figure, you can see recent optimization priorities. What you may not see is the index support that we have recently been developing. With the in-depth use of MaxCompute, this optimization will bring great help in significantly optimizing performance, costs, full-disk scanning, and other aspects. We are also optimizing the compatibility of data organizational structures with the Optimized Row Columnar (ORC) format. We all know that to be compatible with the ORC format, different data organizational structures have a certain balance between the storage compression ratio and performance. In this respect, we will select the best data organizational structure and show you advantages as soon as possible.
Ultimately, you will see that MaxCompute becomes cheaper. With regard to languages, we have comprehensively optimized and upgraded NewSQL and the coverage of optimizers and languages. It has been almost a year since MaxCompute 2.0 was put online. This process was very long. We turned off the trial switch in May 2018, and then opened MaxCompute 2.0 to the entire network.
The following figure shows the types of jobs supported on MaxCompute. Since last year, we have been thinking about how to integrate MaxCompute jobs with the ecosystem. That is to say, if there is only one copy of data, how does MaxCompute support more types of tasks, without migrating or copying data in the process?
Data is shared between projects, based on which we can integrate multiple computing platforms to support more job types. What can be pre-released here is that we will support real-time interactive analysis jobs. It means that when the amount of data is small in tables, we can respond with results in seconds. This feature is called “Lightning.” We will also support Spark jobs, whereby Spark can directly access MaxCompute tables, without importing and exporting or migrating data. In this way, we can achieve the purpose of a joint computing platform: to support more job types and meet requirements of more computing scenarios based on unified data.
New SQL Features in MaxCompute 2.0
SQL optimization includes the optimization of compilers and support for complex data types. The overall work focuses on improving usability and development efficiency, improving compatibility, and reducing migration costs, to optimize SQL from a developer’s perspective.
SQL has a unique advantage that it can use DECLARE statements to write code that everyone can easily understand. You can know what to solve without going through the code. However, this feature lacks some flexibility. In view of this, functions, such as Java or Python functions, can be used in combination with SQL. The combination of SQL and functions can perfectly maintain the simplicity of DECLARE statements while supporting complex business. MaxCompute 2.0 also introduces more built-in functions to improve the convenience of complex business logic.
Unstructured Data on MaxCompute 2.0
Let’s see how MaxCompute processes unstructured data. Unstructured data of Alibaba Cloud is stored on Object Storage Service (OSS), including files, pictures, and videos. Semi-structured data of Alibaba Cloud is stored on Table Store. These are two most important data sources of Alibaba Cloud. MaxCompute defines an external table, in which OSS is the source of unstructured data and Table Store is the source of semi-structured data. You can select whether to perform operations on unstructured data or semi-structured data. Because both OSS and Table Store are operated through APIs, MaxCompute provides a new type of database to further reduce the costs of unstructured data operations and facilitate development. Therefore, this external table can effectively complement the computing scenarios and data scope of MaxCompute.
Alibaba always pursues performance. We constantly try to continuously optimize performance every year. The performance of MaxCompute 2.0 has doubled. When we start to run jobs every midnight, we do not hope to see that jobs are finished at 9:00 the next morning. If we double the performance of MaxCompute 2.0, the computing of tie-1, tie-2, and tie-3 tables can be finished at 5:00 or 6:00 the next morning. Then, business department colleagues are less likely to blame us for delays. Therefore, performance is very critical to offline jobs. In scenarios where more data is to be scheduled and tens of thousands of jobs pile up to be run every night, performance is also very important.
MaxCompute Studio 2.8
Because Alibaba is developing computing engines, it can offer many new features to MaxCompute Studio. MaxCompute Studio was released very quickly as an integrated development environment (IDE) that is customized for MaxCompute and based on the IntelliJ platform.
Many administrators prefer using the efficient black screen to run commands, grant authorization, and manage data. We constantly release new versions of the MaxCompute command line tool, so that different roles can choose different versions. For example, developers can use MaxCompute Studio, and administrators can use the client for daily management.
As a shared service, Logview deeply impresses users because they finally do not have to search a lot of logs for the delivered task–related information. After a task is released, we send each user a link. They can open the link to view all task-related information and the complete context. All performance problems detected by common tools, including full-disk scanning and data cleansing, are displayed in a directed acyclic graph (DAG) to help users further analyze, locate, and diagnose these problems through the task. Sometimes, a large task is delivered and needs to be computed by thousands of instances at the same time. If an instance runs slowly, it slows down the entire job. In this case, we can use Logview to adjust, optimize, and diagnose the process.
We have released Python Open Data Processing Service (PyODPS), which you can use to schedule complex computing tasks through very simple development methods. As the Python SDK of MaxCompute, PyODPS can be easily integrated with Pandas DataFrame to achieve high development efficiency.
In addition, we consider that R can be better promoted in China, so we support RODPS.
We seldom use Java database connectivity (JDBC) now, but mostly access databases through the SDK. After the release of MaxCompute JDBC, we can further integrate with third-party databases through JDBC.
Our extract, transform, and load (ETL) tool integration mechanism has supported many open-source projects, including Flume plug-ins, OGG plug-ins, Sqoop, Kettle plug-ins, and Hive Data Transfer user-defined table-generating functions (UDTFs). You can visit the link in the following figure to integrate ETL tools.
Staff members of Internet companies, including those who joined Alibaba at the early stage, are obviously not willing to keep documents. However, after communicating with customers, we realized that documents can be used to facilitate communication for the long term in scenarios where we cannot see the code. As a product department, we have also done a lot of work on documents. For example, we have established a MaxCompute knowledge base in the Alibaba Cloud Community. We have not only focused on technology, but have also done plenty of work on tools, documentation, and publicity.
Data Security on MaxCompute
Data security is always a top priority because cloud computing depends on security. From a product manager’s perspective, any trace of security issues must have top priority to be resolved, and data security is as important as their lives. On April 27, Hadoop Yarn encountered a security vulnerability. Cloud Shield, a very powerful security department in Alibaba, gave them a lot of security advice. Since Alibaba Cloud MaxCompute began to run on the public cloud long ago, no data security incidents have occurred.
Let’s talk about how we ensure security. The following figure shows the architecture of our data center. We need to open a Region each time. MaxCompute has a special security mechanism. At the logical layer, MaxCompute provides standard serverless services, which are cloud services without servers. In a cluster, tenants are isolated. Then, we isolate projects through logical isolation. In each project, we perform authorization control at a finer granularity by table and column. The logical layer is very finely isolated to ensure that the model is sufficient to support isolation, without the need to open up the file system. To prevent data from being taken or stolen by others, how do we isolate data that is stored in the memory and CPU once jobs are running? In this case, we need to isolate resources. After each user-defined function (UDF) is called, it is placed in a separate resource pool and runs in an isolated resource environment to ensure the security of resources and memory. Therefore, cloud security is not only the isolation among multiple tenants, but also the comprehensive isolation at the logical level, resource level, and operational level to ensure that each tenant can share with others in a secure and reliable manner.
In addition, data security is not enough. Considering that security and sharing always accompany each other, we have achieved both security and secure data exchange and sharing through the logical mechanism. We have achieved a balance, but not absolute security. If nothing happens, data is like stagnant water.
Recently, we have cooperated with Forrester Research, which has been doing cloud-based evaluations. Alibaba Cloud MaxCompute and DataWorks have been rated in the top quadrant. Alibaba is honored to be ranked in second place right after Amazon Web Services (AWS). Google ranks third, Microsoft ranks fourth, followed by many other tech companies. We are also thinking about how Forrester Research evaluates current cloud data warehouses (CDWs). They divide CDWs into three categories: standard multi-tenant CDW, dedicated CDW, and managed CDW. These three categories differ greatly from one other. Standard multi-tenant CDWs include MaxCompute shared clusters and BigQuery serverless services. In a dedicated CDW, a dedicated part is separated from the CDW. In this dedicated part, all resources are locked, cannot be shared, and are logically isolated from other resources of the same CDW. In a managed CDW, virtual machines, if regarded as physical resources, are isolated.
In addition, CDWs are also comprehensively evaluated from various aspects, including self-service administration, elastic scale, automatic upgrade, data loading and unloading facility, hybrid cloud, and disaster recovery.
We looked at TPCx-BB, a standard for measuring the computing power of big data, and the excellent performance of MaxCompute 2.0 in BigBench. We also described in detail the latest developments of MaxCompute to help you fully understand MaxCompute 2.0. With regards to data security, which is of great concern to public cloud users, Alibaba has implemented the logical isolation/resource isolation/operation isolation mechanism to guarantee data security and realize secure data exchange and sharing.
The development direction of MaxCompute is still simple: back to basics. We hope that it can have a better ecosystem and be more open, faster, cheaper, simpler, easier to use, and more stable. We also hope that it can run at high speed all day and all night.
To learn more about MaxCompute, visit www.alibabacloud.com/product/maxcompute