Alibaba Cluster Data: Using 270 GB of Open Source Data to Understand Alibaba Data Centers

5 min readJan 14, 2019

Today, many powerful Internet applications are run in large-scale data centers, but how much do you actually know about these data centers? In technical documents for data centers, you often see terms such as “large scale” or even obscure ones like “a sea of requests.” Besides reading technical articles, you may find it hard to learn more about data centers.

How well does each machine run in a data center? What kind of applications run on these machines? What are the characteristics of these applications? While experienced professionals and senior experts in this field may be able to provide clear answers, a typical technology practitioner or enterprise researcher may not be able to do so.

What Is Alibaba Cluster Data?

In 2015, we put forth a plan to deploy latency-insensitive offline batch computing tasks and latency-sensitive online services in the same batch of machines in Alibaba data centers, so that idle resources not being used by online services can be used offline to improve the overall machine utilization rate. After over three years of experimental reasoning, architecture adjustment, and resource isolation and optimization, the plan has been put into large-scale production. By using the co-location technology, we improved the average cluster resource utilization rate significantly from 10% to 45%. In addition, with a variety of optimization methods, we can have more tasks running in data centers. For example, we reduced the resource consumption cost of every 10,000 transactions during Double 11 events by 17%.

So, what exactly does a computer cluster look like after receiving these optimizations? How well does co-location technology perform? In addition to articles, directly publishing data can help bridge the knowledge gap between many of us and academic researchers/industrial experts. We released this dataset to give interested students and researchers a more thorough understanding of large-scale data centers from the data perspective. This dataset contains details about servers running tasks in a production cluster. It provides insights such as how we use co-location technology to increase the resource utilization rate to 45%, exactly how many tasks we run every day, and the characteristics of our business resource requirements. How you use this dataset depends completely on your needs.

What Can You Do with This Copy of Data?

We have just released Alibaba Cluster Data V2018, which contains six files (270+ GB uncompressed; 50 GB compressed) with 8-day running information about 4,000 servers and their corresponding online application containers and offline computing tasks. You can find detailed information in GitHub.

With this copy of data, you can do the following:

Understand the characteristics of the servers and tasks running in advanced contemporary data centers.
Test your various algorithms for managing tasks (such as scheduling and planning) and optimizing clusters, and compose reports.
Learn how to perform data analysis and reveal data patterns that even we did not recognize.

Despite the preceding description, you may still be wondering what you can do with this data if you have no background in similar data. Let us take a look at several simple examples:

E-commerce business traffic varies between day and night. How can we improve the overall resource utilization rate during business traffic ups and downs.
How many dependencies does our longest DAG have?
How long does a typical container exist?
How long is the typical existence duration of a computing task? Do multiple instances of a task also have the same running duration because they are theoretically similar to each other?

Scholars can even use this copy of data to make better analyses.

In 2017, we published our first wave of data (Alibaba Cluster Data V2017), which contributed to many excellent academic papers. The following are examples showing Alibaba Cluster Data V2017 being referenced in academic papers, many of which are included in the world-leading OSDI symposium. We look forward to seeing what sort of achievements that you can implement using this data!

“LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation, Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang, Purdue University. OSDI’18” (Best paper award!)

“Imbalance in the Cloud: an Analysis on Alibaba Cluster Trace, Chengzhi Lu et al. BIGDATA 2017”

“Characterizing Co-located Datacenter Workloads: An Alibaba Case Study, Yue Cheng, Zheng Chai, Ali Anwar. APSys2018”

“The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace, Qixiao Liu and Zhibin Yu. SoCC2018”

What’s New in Cluster Data V2018

This section shows the two most distinct differences between V2018 and V2017.

DAG Information Added

We added the DAG information about offline tasks, which is reportedly the largest DAG data from an actual production environment.

What Is a DAG?

A DAG (Directed Acyclic Graph) is often used for orchestrating offline computing tasks, like common tasks in Map Reduce, Hadoop, Spark, and Flink, and involves concurrence, dependencies, and other aspects between tasks. The following is an example DAG.

Larger Scale

V2017 includes the content data of around 1,300 servers within about 24 hours, while Cluster Data V2018 includes the data of 4,000 servers within 8 days.

Visit http://alibabadeveloper.mikecrm.com/BdJtacN and complete the questionnaire to obtain the download link to the data and data format description.

Reference:https://www.alibabacloud.com/blog/alibaba-cluster-data-using-270-gb-of-open-source-data-to-understand-alibaba-data-centers_594340?spm=a2c41.12498609.0.0