Alibaba Cloud introduced the deep learning tool Arena to the open-source community in July 2018. Now, data scientists can run deep learning on the cloud without having to learn to manipulate low-level IT resources. They can start a deep learning task within a minute, and create a heterogeneous computing cluster within fifteen minutes.
Why Build a Tool Like Arena?
Today, KubeFlow is the most popular deep learning solution within the Kubernetes community, so isn’t Arena just reinventing the wheel? KubeFlow is a combinable, portable, and expandable machine learning technology stack built on Kubernetes. It is an end-to-end solution that supports Jupyter Hub development, TFJob model training to TF-serving, and Seldon prediction. However, KubeFlow requires a mastery of Kubernetes. For example, writing a yaml file to deploy a TFJob is quite challenging for the primary users of a machine learning platform — data scientists.
Such tasks diverge from the expectations of data scientists, who care only about three things:
- Where the data comes from.
- How to run the machine learning code.
- How to examine training results (models and logs).
Data scientists are familiar with and enjoy the work method of writing a few simple scripts and running machine learning code on their desktops. However, the space limitations of their hard drives limit the quantity of data they can process, and their computing power is limited when they have no way to take advantage of distributed training.
This is why we developed Arena. This command line tool shields you from the complexities of low-level resources, environment administration, task scheduling, and GPU scheduling and assignment. Arena helps data scientists submit training tasks and check training progress in the straightforward way with which they are already familiar. When data scientists call Arena, they can designate the data source, code to download, and whether to use TensorBoard to check training results.
What Is The Role of Arena?
Arena currently supports standalone training and PS-Worker model distributed training. On the backend, it relies on the TFJob provided by KubeFlow. Soon, it will be expanded to support MPIJob and PytorchJob also.
It also supports real-time training operations and maintenance including:
- Utilization of the “top” command to monitor the allocation and scheduling of GPU resources.
- CPU and GPU resource monitoring.
- Real-time checking of training logs.
In the future we hope to provide a deep learning production line through Arena that covers the whole process, including integrated training data management, experiment management, model development, continuous training, evaluation, and online prediction.
The goal of Arena is allow data scientists to unleash the power of KubeFlow as easily as training on a desktop, while also giving them control over cluster-level scheduling and administration. We have published our source code on GitHub to better share and cooperate with the open-source community: https://github.com/AliyunContainerService/arena. Everybody is welcome to check it out and use it. If you like it, please star it. We also welcome your contributions to the code.
The Story behind Arena
The open-source tool Arena was born as Alibaba Cloud’s Deep Learning Solution. It already supports many deep learning frameworks (such as TensorFlow, Caffe, Hovorod, and Pytorch), and it supports the whole deep learning production line from start to finish (including the steps of integrated training data management, experiment management, model development, continuous training and evaluation, and online prediction).
This solution deeply integrates the resources and services of Alibaba Cloud. It efficiently utilizes heterogeneous resources like the CPU and GPU, and it centralizes containerization, orchestration, and management, also providing monitoring warnings and a platform for operation and maintenance.
Zhang Kai, a senior technical solution architect at Alibaba Cloud said,
“Deep learning has brought about a revolutionary leap in the development of artificial intelligence, yet it has also sharply increased our reliance on computing and data resources. Alibaba Cloud provides end-to-end support for large-scale training, and we are continuously polishing this deep learning solution to make it easier to use and give it more powerful features.”