Build a Machine Learning System Using Kubernetes

By Che Yang, nicknamed Biran At Alibaba.

This is part of a series of articles that uses Alibaba Cloud Container Service as an example to help you get started with Kubeflow Pipelines.

The engineering associated with machine learning is complex due to common software development problems and the data-driven features of machine learning. As a result, the workflow becomes longer, data versions are out of control, experiments cannot be easily traced, results cannot be conveniently reproduced, and it is costly to iterate the model. To resolve these inherent issues in machine learning, many enterprises have built internal machine learning platforms to manage the machine learning lifecycle, such as Google’s TensorFlow Extended platform, Facebook’s FBLearner Flow platform, and Uber’s Michelangelo platform. However, these platforms depend on the internal infrastructure of these enterprises. This means that they cannot be completely open-source. These platforms use the machine learning workflow as the framework. This framework enables data scientists to flexibly define their own machine learning pipelines and reuse existing data processing and model training capabilities to better manage the machine learning lifecycle.

Google has extensive experience in building machine learning workflow platforms. Its TensorFlow Extended platform supports Google’s core businesses such as search, translation, and video playback. More importantly, Google has a profound understanding of engineering efficiency in the machine learning field. Google’s Kubeflow team made Kubeflow Pipelines open-source at the end of 2018. Kubeflow Pipelines is designed in the same way as Google’s internal TensorFlow Extended machine learning platform. The only difference is that Kubeflow Pipelines runs on the Kubernetes platform while TensorFlow Extended runs on Borg.

What is Kubeflow Pipelines?

The Kubeflow Pipelines platform consists of the following components:

  • A console for running and tracing experiments.
  • The Argo workflow engine for performing multiple machine learning steps.
  • A software development kit (SDK) for defining workflows. Currently, only the Python SDK is supported.

You can use Kubeflow Pipelines to achieve the following goals:

  • End-to-end task orchestration: You can orchestrate and organize a complex machine learning workflow. This workflow can be triggered directly at a scheduled time, or be triggered by events or even by data changes.
  • Easy experiment management: Scientists can try numerous ideas and frameworks and manage various experiments. Kubeflow Pipelines also facilitates the transition from experiments to production.
  • Easy reuse: You can quickly create end-to-end solutions by reusing pipelines and components, without the need to rebuild experiments from scratch each time.

Run Kubeflow Pipelines on Alibaba Cloud

In view of the capabilities of Kubeflow Pipelines, are you eager to take a sneak peek? However, to use Kubeflow Pipelines in China, you must overcome the following challenges:

  1. Pipelines must be deployed by using Kubeflow. However, Kubeflow includes many built-in components and it is complicated to use Ksonnet to deploy Kubeflow.
  2. In addition, with its heavy dependence on the Google cloud platform, Pipelines cannot run on other cloud platforms or bare metal servers.

To allow users to install Kubeflow Pipelines in China, the Alibaba Cloud Container Service team provides a Kustomize-based deployment solution. Unlike basic Kubeflow services, Kubeflow Pipelines depends on stateful services such as MySQL and Minio. Therefore, data persistence and backup are required. In this example, we use standard Alibaba Cloud solid-state drives (SSDs) in the data persistence solution to automatically store MySQL and Minio data separately. You can deploy the latest version of Kubeflow Pipelines on Alibaba Cloud offline.

In Linux or Mac OS, run the following commands:

In Windows, you can download and install kustomize_2.0.3_windows_amd64.exe.

  • For more information about creating a Kubernetes cluster in Alibaba Cloud Container Service, click Here.

Deployment Procedure

Log on to the Kubernetes cluster through secure shell (SSH). For more information about this step, click Here.

Download the source code.

Specify security configurations.

Configure a transport layer security (TLS) certificate. If you do not have any TLS certificates, run the following commands to generate one:

If you have a TLS certificate, upload the private key and certificate to kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.key and kubeflow-aliyun/overlays/ack-auto-clouddisk/tls.crt, respectively.

Set a password for the admin account.

Use Kustomize to generate a .yaml deployment file.

Check the region and zone of the Kubernetes cluster, and replace the zone as required. For example, if your cluster is in the cn-hangzhou-g zone, you can run the following commands:

We recommend that you check whether the /tmp/ack-auto-clouddisk.yaml file is updated.

Change the container image address from to

We recommend that you check whether the /tmp/ack-auto-clouddisk.yaml file is updated.

Adjust the disk space, for example, to 200 GB.

Verify the .yaml file of the Kubeflow Pipelines.

Deploy the Kubeflow Pipelines service through kubectl.

Let’s look at how to access the Kubeflow Pipelines service. Here, we use Ingress to expose the Kubeflow Pipelines service. In this example, the IP address of the Kubeflow Pipelines service is The URL of the Kubeflow Pipelines console is []().

Gain access to the Kubeflow Pipelines console.

If you are using a self-signed certificate, the system will notify you that the connection is not private. You can click Advanced to view details and then click Visit to visit the website.

Enter the username admin and the password you specified before.

Now, you can manage and run training tasks in the Kubeflow Pipelines console.

1. Why do we use standard Alibaba Cloud SSDs in this example?

Standard Alibaba Cloud SSDs have several advantages. For example, you can schedule standard Alibaba Cloud SSDs to periodically back up the metadata of Kubeflow Pipelines to prevent data loss.

2. How can I back up the data of a disk?

If you want to back up the data stored in a disk, you can manually create snapshots of the disk or apply an automatic snapshot creation policy to the disk to automatically create snapshots on a schedule.

3. How can I undeploy Kubeflow Pipelines?

To undeploy Kubeflow Pipelines, follow these steps:

  • Delete the components of the Kubeflow Pipelines.
  • Click Release disks to release the two disks that store MySQL and Minio data.

4. How can I use an existing disk as a database storage instead of automatically creating a disk?

For a detailed answer to this question, click here.


This document has gone over what Kubeflow Pipelines are, covered some of the issues resolved by Kubeflow Pipelines, and the procedure for using Kustomize on Alibaba Cloud to quickly build Kubeflow Pipelines for machine learning. In the future, we will share the procedure for developing a complete machine learning workflow based on Kubeflow Pipelines.

Original Source:

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.