Building a High Performance Container Solution with Super Computing Cluster and Singularity

High Performance Container: Singularity

Singularity is a container technology developed by the Lawrence Berkeley National Lab specifically for large-scale and cross-node HPC and DL workloads. Singularity features lightweight, fast deployment, and convenient migration. It supports conversion from Docker images to Singularity images. Singularity differs from Docker in the following aspects:

User Permissions

Singularity can be started by both root and non-root users. Before and after the startup of the container, the user context remains unchanged. Therefore, user permissions are the same both inside and outside the container.

Performance and Isolation

Singularity emphasizes the convenience, portability, and scalability of the container service, and weakens the high isolation of the container process. Therefore, Singularity is more lightweight, has a smaller kernel namespace, and results in less performance loss.

HPC-Optimized

Singularity is highly suitable for scenarios where HPC is used. It allows full utilization of host software and hardware resources, including the HPC scheduler (PBS and Slurm), cross-node communication library (IntelMPI and OpenMPI), network interconnection (Ethernet and InfiniBand), file systems, and accelerators (GPU). Users can use Singularity without having to perform extra adaptation to HPC.

E-HPC Elastic High Performance Container Solution

Alibaba Cloud E-HPC integrates the open source Singularity container technology. While supporting the rapid deployment and flexible migration of user software environments, E-HPC also ensures the high availability of on-cloud HPC services and compatibility with existing E-HPC components, delivering an efficient and easy-to-use elastic and high performance container solution to users.

Singularity Deployment Cases

Case 1: Run the NAMD Container Job on Multiple SCC Nodes

NAMD is a type of mainstream MD simulation software featuring good scalability and high parallel efficiency. It is often used to process large-scale molecular systems. In the following description, we assume that a Singularity image containing Intel MPI, NAMD, and inputfile is created based on the image docker.io/centos:7.2.1511. The PBS scheduler is used to submit the NAMD container job and a local job sequentially to four SCC nodes (ecs.scch5.16xlarge, Intel Xeon (Skylake) Gold 6149, 3.1 GHz, 32 physical cores, and 192 GB). The PBS job script is as follows:

#!/bin/sh
#PBS -l ncpus=32,mem=64gb
#PBS -l walltime=20:20:00
#PBS -o namd_local_pbs.log
#PBS -j oe
# Run the job in the Singularity container
/opt/intel/impi/2018.3.222/bin64/mpirun --machinefile machinefile -np 128 singularity exec --bind /usr --bind /sys --bind /etc /opt/centos7-intelmpi-namd.sif /namd-cpu/namd2 /opt/apoa1/apoa1.namd
# Run the job on the local host
/opt/intel/impi/2018.3.222/bin64/mpirun --machinefile machinefile -np 128 /opt/NAMD_2.12_Linux-x86_64-MPI/namd2 apoa1/apoa1.namd

Case 2: Run the TensorFlow Image Classification Container Job on an EGS Instance

CIFAR-10 is a classic dataset in the image recognition field. In the following description, it is assumed that a Singularity and a Docker container that contain the image classification model are created based on the image docker.io/tensorflow/tensorflow: latest-devel-gpu-py3. Based on these two containers, training is carried out on a single EGS node (ecs.gn5-c8g1.4xlarge, Intel Xeon E5–2682v4, 2.5 GHz, 16 vCPUs, 120 GB, and 2 P100s). The command lines are as follows:

# Run the job in the Singularity container
singularity exec --nv /opt/cifar10.sif python /cifar10/models/tutorials/image/cifar10/cifar10_multi_gpu_train.py --num_gpus=2
# Run the job in the Docker container
nvidia-docker run -it d6c139d2fdbf python /cifar10/models/tutorials/image/cifar10/cifar10_multi_gpu_train.py --num_gpus=2

Conclusion

Alibaba Cloud Super Computing Cluster integrates the open source Singularity container technology to deliver an efficient and easy-to-use on-cloud elastic high performance container solution. This solution greatly reduces users’ cloud migration costs and improves their scientific research efficiency.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alibaba Cloud

Alibaba Cloud

Follow me to keep abreast with the latest technology news, industry insights, and developer trends. Alibaba Cloud website:https://www.alibabacloud.com