How Is Alibaba Cloud Using Supercomputing to Fight the Epidemic?
In order to win this inevitable battle and fight against COVID-19, we must work together and share our experiences around the world. Join us in the fight against the outbreak through the Global MediXchange for Combating COVID-19 (GMCC) program. Apply now at https://covid-19.alibabacloud.com/
By Lu Xiaoming (Xiaoming)
Recently, you may have come across the terms cloud-based commuting, cloud telemedicine, cloud discos, and cloud gyms; it seems everything has moved to the cloud during the COVID-19 outbreak. These cloud capabilities are made possible thanks to a revolutionary computing method known as High-performance Computing as a Service (HPCaaS).
On March 6, Dr. He Wanqing, the director of High Performance Computing at Alibaba Cloud, participated in an online salon jointly organized by the China Big Data and Intelligent Computing Industry Alliance and AIPharos Moonlight Society. This event was titled “Challenges and Opportunities for Big Data and Intelligent Computing during the Epidemic”. Dr. He gave a presentation entitled “Alibaba Cloud Supercomputing and AI Unleash the Potential of New Drug R&D”. As one of the major contributors to the establishment of the world’s first cloud-based global supercomputing center, Dr. He discussed physical machine performance and virtualization implementations in cloud supercomputing. He showed how this technology provides the computing power needed for the R&D and screening of drugs to fight the novel coronavirus. The high-performance computing (HPC) product offered by Alibaba Cloud is a Platform-as-a-Service (PaaS) product designed for the benefit of all. Alibaba Cloud believes that this new infrastructure is akin to providing centralized heating so every house does not have to tend its own fire.
The following content was presented by Dr. He Wanqing.
The Urgent Need for Powerful Computing Capabilities
Due to the outbreak of the novel coronavirus, to carry out drug research, candidate screening, and other work, many universities and other institutions need to quickly obtain the required computing resources. Therefore, during the Chinese New Year holiday, Alibaba Cloud began to provide computing support for new drug R&D free of charge.
Currently, more than 10 organizations, including many leading public research teams, such as the Global Health Drug Discovery Institute (GHDDI), have used Alibaba Cloud’s services.
Our technologies have been applied in three major fields: drug development and screening, rapid lung CT diagnosis through machine learning and statistical methods, and infection rate tracking and prediction for rapid disease control.
Why Cloud Supercomputing Is Needed to Fight the Epidemic
In the research on the novel coronavirus, there is a major class of applications called non-invasive detection standards. This primarily refers to nucleic acid detection, but this process is slow. The media has reported that it still takes up to 3 hours to extract nucleic acid samples when medical personnel must wear protective equipment.
Based on a large amount of clinical data, patient lung CTs, and other information, we can use big data and machine learning to confirm diagnoses. However, this approach requires lots of computing power.
In the computing industry that was created to calculate projectile trajectories, high-performance computing (HPC) has been a goal from the very first. In recent years, the introduction of GPUs has resulted in great strides in computing power. Many jobs that took days or even months in the past can now be completed shortly through GPU acceleration.
Limitations of Existing Supercomputers
Supercomputing centers provide vast amounts of computing power, but they also suffer from the same isolation problem that we have had to face during the outbreak. Making configuration changes and upgrades is easy in the cloud with cloud-based supercomputing features. This is because enterprises such as Alibaba, Amazon, and Microsoft that serve millions of companies cannot afford any interruption.
At the same time, new drug screening is a concurrent process. HPC is used in a large number of applications, such as molecular dynamics (MD), computational chemistry, and gene assembly. Like these applications, drug simulation preparation and virtual screening require high-throughput parallel computing.
About two years ago, although China was home to the world’s two most powerful supercomputers (Sunway TaihuLight and Tianhe-2A), supercomputing was still a field open only to a select few. In the early years of the industry, some people proposed HPC as a Service, but we first had to develop computing, storage, and network technologies that met the standards of supercomputing.
Supercomputer Cluster based on the X-Dragon Architecture
To make the supercomputing field more accessible, Alibaba Cloud developed a supercomputer cluster based on the X-Dragon architecture in 2017. By building low-latency and high-bandwidth network clusters, we set up virtualized and scalable nodes. Together with graphic nodes, logon nodes, and storage products, these components constitute a platform.
Currently, for industrial simulation, chip designers, and other big computing products, Alibaba Cloud provides services through high-performance supercomputer clusters.
As long as they have Alibaba Cloud accounts, users can log on to enjoy the high-performance computing resources provided by Elastic High Performance Computing (E-HPC) and share their drug research results on the GHDDI platform supported by E-HPC.
Another advantage of supercomputer clusters is that, as long as cloud computing zones are widely distributed around the globe, drug researchers can quickly access drug databases provided anywhere in the world.
Researchers can perform drug screening through convolutional neural networks or knowledge graphs based on Alibaba Cloud E-HPC. This significantly accelerates the drug discovery process, and machine learning can be used for further acceleration. Thanks to this sort of AI computing power, Alibaba Cloud provided the support that allowed GHDDI to publish drug screening data for the eight biological targets of COVID-19 on its public platform.
HPC in the Cloud
HPC applications can be roughly divided into three fields.
The first field is public science, including scientific research. The second field is industry computing, including car collisions, car design, semiconductors, and chips. These applications require continuous computing on a large scale. The third field is universities and scientific research institutions, where the required computing time and scale vary according to the institution’s needs.
Industrial computing requires long-term stability, which means massive computing resources must be available at any time. It is difficult to use user-built or rented supercomputing centers because they do not provide the required elasticity, stability, and availability. However, these are the most important features of cloud computing. The largest investment of cloud vendors is in stability and elasticity, as well as the personnel and technologies required for the corresponding O&M. This is also one of the most significant strengths of cloud vendors.
Then, what computing scales can be provided by cloud solutions?
In fact, extremely large-scale computing is not suitable for running in the cloud. Computing of this type is possible in the cloud, but this is not economical.
However, if you need computing power for an academic paper or to compile a list, most of the resources will be idle after the project is done. These applications are perfect for the cloud because they take advantage of another feature of cloud computing, that is, shareability. With this feature, computing resources can be shared among a large number of users.
HPC vs. AI
In addition, the problems solved by HPC are quite different from the problems that artificial intelligence (AI) can solve.
To solve the behavior equation of the target object, HPC splits the equation into individual nodes for calculation and synchronizes them to each time node for synchronous communication before moving to the next step. Therefore, if one node becomes slow, all the other nodes have to wait. This means that HPC poses high requirements on computing, storage, and network and has low tolerance for latency. This computation method is mainly used in applications with a large number of mesh nodes, such as meteorology, large-scale MD simulation, and car collisions.
Although some applications do not have demanding latency requirements, they do involve many I/O operations or high-speed access to large memory, such as rendering and electronic circuit design. For Alibaba Cloud, all these operations can be implemented on it.
HPC for Life Sciences
In the life science field, many applications require massive message passing interface (MPI) communication. Due to the highly intensive computing required, many applications have developed different GPU versions. Currently, GROMACS and LAMMPS are two of the most commonly used software in this field. These products both perform MD simulation. From a molecular point of view, they see you as your motion, including your interactions, which are viewed as different types of forces. They simulate a group through the behavior of a single molecule, so the group can be in gas state, liquid state, or another form. This large-scale computing is most commonly seen in the preparation of various drugs.
Transforming HPC to a PaaS Product
In the past, cloud computing was not favored by most users who needed HPC. Theoretically, in a single-node experiment, a physical machine would provide better performance than a virtual machine (VM). The unwritten rule of the HPC field was, when purchasing a machine, always buy the best and fastest available.
However, Alibaba Cloud’s X-Dragon servers enable virtualization for large computing systems, especially for HPC systems. This completely eliminated the performance loss due to virtualization, a dreaded scourge in the HPC field.
The X-Dragon servers, which came out in 2017, implemented virtualization through our proprietary chips and MOC cards and integrated storage network management. This way, no CPUs were wasted, and 100% of CPU resources could go to the user. Although its advantages are obvious, in theory, it still consumes some additional resources.
When containers are used, as the number of nodes increases, this waste will quickly grow. That is, the higher the performance, the more it tends to the inflection point. However, by using the X-Dragon architecture, the performance of the cluster will get better and better because MOC cards are distributed to each server and provide a synergistic effect.
Remote Direct Memory Access (RDMA)
After the single-node issue is resolved, HPC additionally requires a high-speed network. Let’s consider whether the control of communication between nodes can be satisfactorily implemented based on a VPC instance, just like in the public cloud.
In this case, RDMA-based computation is required to adapt to other computing models. As long as the remote direct memory access (RDMA) connection is used, we can use virtualization and MOC cards for external interconnection to obtain the performance of a physical machine and the interface of a VM. After you build the parallel file system on this architecture, everything is ready. This approach retains the most important features of cloud computing, which are computing, storage, and network separation, and enables integration with different products.
Cloud vs. Traditional Supercomputing Cluster
Typically, to build a supercomputing center or a supercomputing cluster, you need to purchase computing nodes, file storage for the disk array, and PC servers for hardware preparation.
In the cloud, however, all you need is just production. A supercomputer cluster can produce nodes, which means generating VM instances. Although these are VM instances, the functions of the physical machines can be obtained through X-Dragon servers. This system can also produce file storage, network-attached storage (NAS), and parallel file systems, including nodes.
These production capabilities support elastic scaling, which is also the most powerful capability of cloud computing and cloud-based HPC. With auto scaling, you can quickly produce and deploy new clusters according to predefined strategies.
At present, we can perform auto scaling within 90 seconds and copy all installed software.
When using HPC as a service in the cloud, you have several options for uploading files. You can cache files on local disks and then store them in the shared file system. Alternatively, you can do away with this process and perform computing, including visual computing, at an upper layer.
Imagine that all your computing tasks could be completed with a click of the mouse in the cloud. This means you would not have to spend a lot of time learning how to build an HPC system.
This is also a difference between the applications in China and other countries. The reason why many foreign teachers and users prefer cloud supercomputing is that they do not need to learn about HPC or become IT experts. They simply need to know about their applications.
Alibaba Cloud HPC
From this perspective, Alibaba Cloud HPC is a PaaS product. You can use APIs to select any type of compute nodes. Therefore, Alibaba Cloud can now support both supercomputer clusters and elastic computing by using multiple ECS instances.
Elastic Computing in the Cloud: Balancing Costs with Performance
Auto scaling is the area where cloud computing truly shines. You can simply set the node you want and the duration of the computing process or what commands you want to run, and we will resize your resources as needed. We provide many resizing parameters, so you can specify the scaling interval or the scaling ratio.
By using auto scaling with the preemption capability, you will obtain amazing results. By sacrificing 10% of your performance, you can reduce costs by 90%.
This is accomplished through preemptible instances. These instances allow you to claim resources, but under the condition that you can only use the resources for 1 or 3 hours. If your computation fails to finish at the end of this time, the check point function is used to save your progress before the resources are released, so that your project can be resumed later.
Elastic Scaling of HPC Resources
Genetic computing is not a daily activity, but an occasional task. This means its HPC resources must be repeatedly scaled. When the task volume increases, new machines are automatically generated. After the computation task is completed, the system releases the machines when it detects that the computing load has fallen. This solution uses backend technology to supply computing resources just like the use of electricity.
In addition, elastic computing capabilities can reach across regions, zones, and even heterogeneous nodes. For example, genetic screening involves concurrent computing, but it does not matter whether or not the configurations of two nodes are the same. If the scheduling is rational enough, heterogeneous nodes can be used for different operations, giving you even greater flexibility.
For a long time, there has been something of an arms race in the supercomputing field. Many engineers in the industry provide high-performance services, but they can only serve a single user or application at a time. Although this approach may seem really cool, it is by changing the production method or computing method that we can bring the benefits of HPC to all.
Cloud computing allows just this sort of change. The real technological breakthrough can be found in the scheduling performed by X-Dragon in the cloud. This technology helps ordinary people to work or learn from home, making it a truly inclusive breakthrough.
During the current epidemic, Alibaba has provided supercomputing capabilities to GHDDI and Tsinghua University, among others. You can learn about their research here: Alibaba Cloud HPC Empower Computational-Driven-Drug-Discovery for Fighting COVID-19
While continuing to wage war against the worldwide outbreak, Alibaba Cloud will play its part and will do all it can to help others in their battles with the coronavirus. Learn how we can support your business continuity at https://www.alibabacloud.com/campaign/fight-coronavirus-covid-19