How Alibaba’s Bringing AI to More Places

By Ling Chong.

Artificial intelligence technology and applications on the Internet have already made a huge impact in our lives — dramatically changing how applications on the web and mobile work and how powerful they can become. Well, looking towards the future, AI technology is also starting to increasingly move offline, away from being solely Internet- and Mobile-based technology.

As a forerunner of AI technology solutions, Alibaba has being helping to take AI offline and bring it to more real-world applications in several key industries, including retail, transportation and banking. For these industries, as well as many others, the restricting factors of online AI applications in the cloud, like latency, cost, and security, make cloud-based AI tech unfit to meet even their most basic of business requirements. However, with AI taken offline — and the up-and-coming 5G — these industries can begin to take full advantage of this powerful technology.

In fact, offline AI applications have now become integral part of providing comprehensive intelligence solutions in a number of key industries today at Alibaba Group. This blog will look into the technology behind Alibaba’s AI solutions in complete detail, with offline solutions being the main highlight of our discussion.

What’s Alibaba Been Doing

The offline intelligence team of Alibaba’s DAMO Academy and Machine Intelligence Lab began to work in offline AI at the end of 2016, starting from algorithms, engineering, productization, and business implementations of the technology. As of now, the team has made some major breakthroughs together with Alibaba’s partners. In terms of algorithms, we have proposed a self-developed model compression method, a new model structure, and a target detection framework. In terms of engineering, we have also developed a set of data independent quantization training tools, as well as an efficient inference computing database for different hardware platforms. We have also worked with the server R&D team to abstract a software and hardware product solution, and deployed this solution in real business scenarios by providing diversified services.

In the following sections, we’ll summarize and share our previous work in terms of our exploration into various algorithms, training tools, inference frameworks, productization, and business models.


Low-Bit Quantization Based on Alternating Direction Method of Multipliers (ADMM)

Low-bit quantization is a core issue in Model Compression and Inference Acceleration. It aims to quantify the original floating-point parameters in a neural network into 1–8 bits to reduce the model size and computing resource consumption. To solve this problem, we proposed a solution of low-bit quantization based on Alternating Direction Method of Multipliers (ADMM). In the public dataset ImageNet, we have conducted experiments on classical CNN network structures, such as AlexNet, ResNet-18, and ResNet-50. As a result, both its accuracy and speed have exceeded the known algorithms. We can achieve almost lossless compression in 3-bit quantization. At present, this method has been widely used in various real life projects of on-terminal object detection and image recognition. Relevant findings have been published on AAAI 2018.

Unified Quantization and Pruning Framework

The quantization technology can accelerate inference by simplifying the computing unit, changing from a floating point computing unit to a fixed point computing unit. The pruning technology reduces the actual computing workload by pruning paths in the neural network. Naturally, we integrate both technologies to obtain the ultimate theoretical acceleration ratio. In the pruning process, we adopt a progressive training method, and combine the gradient information to determine the importance of a path in the network. In the ResNet structure, we can achieve nearly lossless compression at the sparsity of 90%.

Image for post
Image for post

During the process of researching pruning, we found that higher accuracy is often obtained through finer-grained pruning at the cost of hardware friendliness. Therefore, it is difficult to obtain the theoretical acceleration ratio in practice. In the subsequent topic, we will try to solve this problem from two perspectives:

  • Hardware/software co-design to solve problems simultaneously from the perspectives of software and hardware.
  • New lightweight network to be designed from the software perspective to better adapt to the existing hardware structure.

Network Structure Adopting the Hardware/Software Co-Design

With the quantification and pruning technologies, we can obtain a deep network model with a sufficiently low computing workload and a sufficiently simple computing unit. The next problem is how we can convert it into an algorithm service with low and real inference latency. To challenge the ultimate inference acceleration effect, we work with the server R&D team by starting from hardware/software co-design. In this project, we propose the following innovations:

  • In terms of hardware/software co-design, we propose a heterogeneous multiple branch structure for hardware physical characteristics to maximize parallel efficiency.
  • In terms of algorithms, we use technologies such as quantization, pruning, and knowledge distillation to reduce the theoretical computation workload to 18% of the original model.
  • In terms of hardware, we use the operator filling technology to solve the bandwidth problem brought about by pruning computing, and use the operator shuffling technology to balance the PE load.

With the preceding solution, the latency for completing the model inference of ResNet-18 complexity is as low as 0.174 ms, achieving the best score in the industry. This solution provides significant advantages for latency-sensitive fields. Relevant findings have been demonstrated at HotChips 30.

New Lightweight Network

While hardware/software co-design is an outstanding inference solution, the development and hardware costs of the solution are high. Some specific scenarios have high tolerance for latency and accuracy, for example, face capturing. To meet this requirement, we propose a Multi-Layer Feature Federation Network (MuffNet), which has three features:

  • Sparse topology and easier to obtain high-frequency responses;
  • Dense computing nodes, which ensure hardware friendliness;
  • Full optimization of low-cost hardware, which improves accuracy when the computing workload is low.

The new network that we proposed is highly applicable to running on general hardware because each unit is intensive in computing and there are no excessive fragmentation operations. In the public dataset ImageNet, at the computing capacity of 40 MFLOPS, the absolute accuracy is 2% higher than ShuffleNetV2, which is known as the best structure in the industry.

Image for post
Image for post
Image for post
Image for post

On-Terminal Object Detection Framework

Compared with image recognition tasks, object detection tasks are more widely used. An efficient object detection framework has high research values. For the on-terminal scenario, we propose a light refine single short multibox detector (LRSSD) framework, which has the following features:

  • Simplifies the SSD head and adopts the shared prediction layer to design the fused;
  • Integrates the bounding box (BB) regression in the information cascade form at different scales;
  • Performs the full quantification of the detection model.
Image for post
Image for post

As shown in the preceding table, when the backbone network is the same, the proposed LRSSD can stably increase the mAP by 3% to 4% while reducing the computing workload of the SSD head. From another perspective, our method can reduce the complexity of the model by about 50% without changing the detection accuracy. Considering the speed bonus brought by quantification, when the accuracy is the same, we can get a real speed increase of about 2–3 times higher than the original full precision model.


The preceding description provides some of our technical accumulations related to model compression in the offline intelligence field in the last two years. These technical accumulations are summarized as follows:

  • Quantification: We can achieve almost lossless compression in 3-bit quantization.
  • Pruning: For traditional network structures, we can achieve nearly lossless compression at the sparsity of 90%.
  • Hardware/software co-design: We work with the server R&D team to achieve the ResNet-18 ultimate inference speed of 0.174 ms per image, which is currently the best score in the industry.
  • Lightweight network design: At the computing capacity of 40 MFLOPS, we have increased the absolute accuracy of the ImageNet dataset by 2% compared with the best structure in the industry.
  • On-terminal object detection: We can increase the speed by about 2–3 times without changing the accuracy.

While exploring technologies, we are also actively applying these technologies to actual services. In this process, we found the following problems:

  • Ease-of-use: Business scenarios often require fast iteration and flexible and convenient deployment capabilities. Therefore, it is difficult to widely apply non-standardized solutions.
  • Theoretical speed versus actual speed: In addition to algorithms and hardware, the actual model inference speed requires the support of efficient engineering implementation.
  • Integration: Offline intelligence needs to test the team’s strength in both hardware and software, which is often too much for the services.

In the second half of this document, we first introduce the attempts we have made to address the preceding issues, and several solutions that we eventually come up. Finally, we list some examples to show you how to apply the offline intelligence technology in specific business scenarios to help you better understand the technology.

Training Tools

In the actual service promotion process, the first problem is ease-of-use:

  • Different services often use diversified deep learning libraries, such as Caffe, TensorFlow, and MXnet.
  • The basic technologies used by different services vary dramatically. These basic technologies include classification and recognition, detection, segmentation, and speech.
  • The data security levels of different services vary dramatically. Some of them can be made public, while others require fully physical isolation.

To allow more scenarios to use our services and reap the benefits of AI, we have proposed a set of standardized quantization training tools.

Firstly, as shown in the preceding figure, our tool supports inputs in multiple model formats, such as TensorFlow, Caffe, and MXNet. Secondly, we provide two different model quantization methods. One is the data-dependent compression method that supports different tasks, such as classification, detection, and segmentation. This method is applicable to services that are less demanding on data security but demanding on accuracy. The other is the data-independent compression method. This method is applicable to scenarios where the data security requirements are high, but the service logic is less complicated.

Finally, after the quantification is completed, our tool automatically optimizes the inference graph, encrypts the model, and generates a model file that can be actually deployed. Together with the corresponding inference acceleration library, the tool can run on terminals. From the perspective of ease-of-use and data security, we recommend that you use the data-independent compression method.

Currently, this tool is widely used in multiple offline service scenarios within Alibaba Group as a quantization tool recommended by MNN.

Inference Frameworks

The second problem in practice is the actual inference speed. After all, ease-of-use alone is not enough. The actual service effect is most wanted. Here, we use the inference frameworks provided by other teams of Alibaba Group:

  • ARM architecture: We use the MNN developed by the Taobao technical team as the inference framework.
  • GPU architecture: We use the falcon_conv convolution library developed by the machine intelligence technical team as the inference framework.
  • FPGA architecture: We use the inference framework developed by the server R&D team.

Mobile Neutral Network (MNN)

Mobile Neutral Network (MNN) is a lightweight mobile-side deep learning inference engine that focuses on the running and inference of deep neutral network models. MNN covers the optimization, conversion, and inference of deep neutral network models. At present, MNN has been used in more than 20 apps, including Mobile Taobao, Mobile Tmall, Youku, Juhuasuan (聚划算), UC, Fliggy, and Qianniu (千牛). Common deep neural network models like MobileNet V2 and SqueezeNet V1.1 are used as the test samples. For the Android system (taking Xiaomi 6 as an example), the CPU and GPU of MNN are at least 30% ahead of other like products in the industry. For the iOS system (taking iPhone 7 as an example), the CPU and GPU of MNN are at least 15% ahead of other like products in the industry.

Field-Programmable Gate Arrays (FPGA)

The inference framework on Field-Programmable Gate Arrays (FPGAs) is completed by the server R&D team. The inference time of ResNet-18 is 0.174 ms per image, which is currently the best performance in the industry. The edge computing product, Alibaba Edge, implements efficient operators based on hardware, and its inference speed is twice as fast as the edge GPU. This solution will later be introduced as a product form.


Falcon_conv is a low-precision convolution library developed by the machine intelligence technical team using CUDA C++ and running on the NVIDIA GPU. It accepts the use of two low-precision (int8) tensors as the input, and outputs the convolution result as float or int32 data. In addition, falcon_conv supports the merging of some common operations after convolution, such as Scale, BatchNorm, and ReLU. On a single Tesla P4 GPU, we compared the performance of falcon_conv with cuDNN v7.1, which is NVIDIA’s official computing library, as shown in the figure. In almost all cases, falcon_conv is better than cuDNN. In some individual cases, falcon_conv is even improved by up to five times. Use cases are selected from convolution parameters that consume a lot of time in ResNet and VGG.


The third problem in the service support process is integration and productization. In addition to mobile phone scenarios, other offline services require additional hardware platforms as the support. Previously, we relied more on hardware devices provided by third parties. At that time, the cost, stability, and scalability were pain points that restricted the expansion of offline projects. To solve these problems, we have summarized hardware devices based on our previous project experience, and developed two common offline productization solutions: smart box and integrated camera. Each type of product contains different models for different scenarios.

Smart Box

The first solution that we provide is the smart box solution. We can simply use the smart box as an edge server that is suitable for small- and medium-scale scenarios. The box itself provides multiple interfaces that can be connected to external sensors, such as USB or IP cameras and voice modules. The box supports direct local deployment and features high data security. We provide two editions (high-end and low-end) of smart boxes based on service characteristics. The high-end edition uses Alibaba’s self-developed edge computing product, Alibaba Edge. In addition to a sound hardware design and an efficient inference framework, the box also contains complete compiler support and is very easy to use. The low-end version is a pure ARM box. The following table lists the performance, cost, and applicable scenarios of both editions of boxes.

Here, we will focus on Alibaba Edge, an Edge computing product developed by Alibaba. In addition to its AI computing power of up to 3 TGFLOPS, Alibaba Edge has a significant price advantage over the edge GPU solution. Moreover, Alibaba Edge supports the cloud integrated deployment function. This platform-based product can be quickly launched and supports large-scale O&M.

Image for post
Image for post
Image for post
Image for post

In the following table, we compare the running time of LRSSD300 + MobileNetV2 on different hardware devices, in a hope to give you a more intuitive understanding.

Integrated Camera

Another integration solution that we provide is the integrated camera. The integrated camera is especially suitable for the cloud + terminal deployment mode. It completes relatively simple processing functions on terminals, and performs deep processing on information returned by terminals on the cloud. This saves bandwidth and reduces cloud costs. In addition, the integrated camera is convenient to deploy and features the distinct cost advantage after mass production. Today, the integrated camera has been applied as an important carrier in the cooperation projects of Alibaba Group.

Business Models

In the past two years, we have tried different business models. Several examples in different forms are cited below:

Cainiao Future Park

In the Cainiao Future Park project, we mainly output basic visual algorithms, while the Cainiao smart park team was responsible for the R&D of service algorithms and engineering services. After six months of joint efforts, we had completed multiple functions, including off-duty and sleep detection, fire-fighting channel exception detection, parking space occupancy detection, pedestrian out-of-bounds detection, and entry count detection.

Image for post
Image for post
Image for post
Image for post

In the project cooperation process, we found that the high cost of computing units is one of the main reasons that restrict the extensive promotion of algorithms. To solve this problem, we worked with the server R&D team to develop a customized software and hardware solution. In this solution, the hardware platform is Alibaba Edge, and is also equipped with a specially customized efficient model structure and a self-developed fast detection algorithm. While the detection accuracy of the new solution is almost lossless, the inference speed is increased by 4–5 times, and the cost is 50% lower than that of the edge GPU solution.

Model Compression Acceleration

We have helped different business units of Alibaba Group to complete quantization slimming and acceleration of existing algorithm models. Examples include mobile phone OCR, mobile phone object detection, real-person authentication, and face logon and authentication of Mobile Taobao, Cainiao self-lifting cabinet, face-scanning admission of Alibaba sports events, and Shenzhouying (神州鹰) face recognition cloud album.

Conclusion and Prospects

After two years of much effort, the offline intelligence team of Alibaba Machine Intelligence Lab has been intensely engaged in the offline intelligence field. Algorithms: We have accumulated some experience in low-bit quantization, pruning, hardware/software co-design, lightweight network design, and on-terminal object detection. Multiple indicators turned out to have reached the best in the industry. Engineering: We have developed a set of training tools that feature high flexibility and data security. With the help of our partners, the training tools have delivered the best inference performance in the industry on the ARM, FPGA, and GPU platforms. Productization: Together with our partners, we have developed the smart box and integrated camera solutions that are suitable for different service scenarios. Finally, we are fortunate that we can further polish our technologies in different service scenarios inside and outside the Group.

Original Source

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store