Where Does AI Stand Today?

Image for post
Image for post

By Jin Rong, VP of Research at Alibaba DAMO Academy

Artificial Intelligence (AI) has developed rapidly over recent years. From its infancy to its current prosperity, AI has given rise to many excellent practices. However, it also faces many challenges, which need to be overcome through technological innovation. Through this article, Jin Rong, VP of Research at the Alibaba DAMO Academy, will introduce the current applications and practices of AI, innovations in AI, and future avenues of exploration.

This article is divided into four parts:

  • Technical Background
  • Natural Language Processing
  • Speech Technology
  • Machine Vision
Image for post
Image for post

Part 1: Technical Background

Image for post
Image for post

1.1 Machine Learning

Image for post
Image for post

1.2 Deep Learning

Image for post
Image for post

1.3 Critical Points in the Development of AI

Part 2: Natural Language Processing

2.1 NLP Models

Image for post
Image for post

Given the limitations of traditional methods, deep learning can be used for good effect in NLP. The most successful type of deep learning model in this field is the deep language model. It differs from traditional methods in that the context information of all words is represented by tensors. It also can use bidirectional expressions, which means it can predict both the future and the past. In addition, the deep language model uses the Transformer structure to better capture the relationships between words.

Image for post
Image for post

Natural Language Models — Q&A Applications

Traditional Q&A applications use frequently asked question (FAQ) pairs and knowledge-based question answering (KBQA). For example, in the following figure, the Q&A pairs are a database that contains the questions and answers. This approach is relatively conservative, and the people who compile the questions and answers must have a deep understanding of the relevant domain. In addition, it is difficult to expand the domain and cold start is slow. To solve these problem, researchers introduced machine reading comprehension technology, which can automatically find answers to questions from a document. It does this by converting questions and documents into semantic vectors through the deep language model, and then finding matched answers.

Currently, Q&A applications are widely used by major enterprises, such as AlimeBot and Xianyu. Every day, these applications help millions of buyers obtain information about products and campaigns through self-service.

Image for post
Image for post

2.2 NLP — Machine Translation

Image for post
Image for post

As shown in the following figure, Google Brain’s evaluation report on neural networks shows that phrase-based translation models provide limited performance, whereas translation models based on neural networks have improved significantly. Machine translation is already widely used in Alibaba businesses, such as for the translation of product information in e-commerce businesses and AI translation in DingTalk. However, there is still room for DingTalk AI translation to improve in the future because DingTalk conversations are very casual and do not follow strict grammatical rules.

Image for post
Image for post

Part 3: Speech Technology

Image for post
Image for post

Generally, speech recognition involves two models, a language model and an acoustic model. The main function of a language model is to predict the probability of a word or word sequence. An acoustic model predicts the probability of generating feature X based on the pronunciation of the word W.

Image for post
Image for post

3.1 Speech Recognition

The traditional hybrid speech recognition system is called GMM-HMM (Gaussian Mixture Model-Hidden Markov Model). The GMM is used as the acoustic model, and the HMM is used as the language model. Even though great efforts have been made in the field of speech recognition, machine speech recognition still cannot be compared to human speech recognition. After 2009, speech recognition systems based on deep learning began to develop. In 2017, Microsoft claimed that their speech recognition system showed significant improvements over traditional speech recognition systems and was even superior to human speech recognition.

Traditional hybrid speech recognition systems contain independently optimized acoustic models, language models, and linguist-designed pronunciation models. As you can clearly see, the construction process of traditional speech recognition systems is very complicated. It requires the parallel development of multiple components, with each model independently optimized, resulting in unsatisfactory overall optimization results.

Image for post
Image for post

End-to-End Speech Recognition Systems

Learning from the problems of traditional speech recognition systems, end-to-end speech recognition systems combine acoustic models, decoders, language models, and pronunciation models for unified development and optimization. This allows such systems to achieve optimal performance. Experimental results show that end-to-end speech recognition systems can further reduce the error rate during recognition by over 20%. In addition, the model size is significantly reduced to only several tenths that of a traditional speech recognition model. Also, end-to-end speech recognition systems can work in the cloud.

Image for post
Image for post

3.2 Speech Synthesis

Image for post
Image for post

History of Speech Synthesis

Speech synthesis technology originally started with GMMs, transitioned to HMMs in 2000, and then evolved to deep learning models in 2013. In 2016, WaveNet made a qualitative improvement in speech quality compared with previous models. End-to-end speech synthesis models emerged in 2017. In 2018, Alibaba’s Knowledge-aware Neural model not only could produce speech with good acoustic quality, but also with significantly reduced model size and with improved computing efficiency, therefore enabling effective real-time speech synthesis.

Image for post
Image for post

There has been a large borderline area in speech synthesis, that is, high customization cost. Generally, traditional speech customization requires professional speakers, a recording studio, precise manual labeling, and a large amount of audio data with a length typically over an hour. Nowadays, speech synthesis is attempting to provide personalized voice customization. With this feature, normal users can complete voice customization by recording on their mobile phones, even in a noisy environment. For example, you could change the voice of your in-car navigation system to the voice of a family member.

Image for post
Image for post

3.3 Multimodal Speech Interaction Solutions

For example, if we want to design voice-interaction ticketing kiosks for subway stations, we must account for the fact that the user will be surrounded by many other people, who may also be talking. To solve this problem, we can give the kiosk vision capabilities, so it can identify the user who wishes to buy a ticket based on face size. The following figure shows the primary algorithm processing flow for target speaker speech separation based on facial feature monitoring information. The final step is the input of extracted audiovisual feature to a source mask estimation model based on audiovisual fusion.

Image for post
Image for post

Applications of Audiovisual Fusion Technology

Part 4: Vision Technology

4.1 Image Search

Image for post
Image for post

Currently, image search applications face three major challenges. First, they must deal with more and more data, with training data sets containing billions of samples. Second, image search must handle hundreds of millions of classifications. Third, model complexity is continuously increasing.

Image for post
Image for post

To solve these challenges, Alibaba launched Jiuding, a large-scale AI training engine. Jiuding is a large-scale training vehicle and an expert system, covering vision, NLP, and other fields. Jiuding consists of two parts. The first part is communication. As all large-scale training requires multi-machine and multi-GPU architectures, finding ways to effectively improve the model training under such architectures and reduce the cost of communication is a very important area of research. The other part is optimization algorithms. Implementing distributed optimization is also a major challenge. This large-scale training engine can classify large volumes of data and achieve ideal training results. For example, it can complete ImageNet ResNet50 training in 2.8 minutes. To process hundreds of millions of IDs, training for the classification of billions of images can be completed within seven days.

Image for post
Image for post

Image Search Applications

Image search is widely used in real-life scenarios. Currently, Pailitao can process ultra-large-scale image recognition and search tasks, including tasks that involve more than 400 million products, more than 3 billion images, and more than 20 million active users. It can identify more than 30 million entities, such as SKU commodities, animals, plants, vehicles, and more.

Image for post
Image for post

Tianxun is a recognition and analysis application for remote sensing images. It can carry out large-scale remote sensing image training, and process tasks such as road network extraction, terrain classification, new building identification, and illegal building identification.

Image for post
Image for post

4.2 Image Segmentation

Image for post
Image for post

In contrast, as shown in the figure below, segmentation technology based on deep learning uses supervised learning, which incorporates many training samples. Segmentation and classification results can be obtained at the same time, so the machine can understand the instance attributes of each pixel. When given a large volume of relevant data, the encoder and decoder model can finely partition the edges of the object.

Image for post
Image for post

Image Segmentation Applications

Alibaba applies image segmentation technology to all categories of products in Taobao ecosystem. With this technology, Alibaba can automatically generate product images on a white background to accelerate product publishing.

Image for post
Image for post

In addition, this technology can also be used to mix and match clothing and apparel. When merchants provide model images, segmentation technology can be used to dress the models in different clothing.

Image for post
Image for post

4.3 Model Compression

Model compression methods have developed over a long period of time. For the model shown in the following figure, unimportant boundaries can be removed to sparsify the model. Then, the model boundaries are quantified and given different weights. Finally, the model is branched to change its structure. By using a field-programmable gate array (FPGA) acceleration solution, we can speed up the model inference by a factor of 170 compared to GPUs under the same QPS conditions (only 174us is required for RESNet-18.)

Image for post
Image for post

In essence, model compression changes the structure of a model. The selection of a model structure is a difficult problem. It is not a common optimization problem as different structures are in discrete spaces. The cargotainer method proposed by Alibaba can obtain accurate pseudo gradients more quickly. In fact, it won the Low-Power Image Recognition competition held at the 2019 International Conference on Computer Vision.

Image for post
Image for post

Applications of Model Compression Technology

The FPGA-based solution is applied in Hema self-service POS terminals, where a computer vision method is used to identify whether a product has been missed out during the bar-code scanning. This cuts the GPU cost by half. In addition, our proprietary high-efficiency detection algorithm can complete various behavior analysis tasks within 1 second, and the classification accuracy of scanning actions exceeds 90%. The scenario classification accuracy is above 95%.

Image for post
Image for post

4.4 Target Detection

Image for post
Image for post

Target detection technologies have also been around for a long time. Traditional detection methods, such as histogram of oriented gradients (HoG) and deformable part model (DPM), rely on handcrafted features, which means manually selected features. The problem with such methods is that they have poor robustness, cannot be generalized, and the computation workload is highly redundant. Currently, many new target detection methods based on deep learning have been proposed, such as Faster R-CNN, SSD, RetinaNet, and FCOS. Their advantages lie in that the machine can identify features itself, they can better cope with object size and appearance changes, and they provide good generalization performance. In fact, from 2008 to 2019, the accuracy of target detection increased from about 20% to about 83%.

4.5 Target Tracking

Image for post
Image for post

Target Detection and Tracking Applications

Target tracking is extensively applied in new retail scenarios. Shopping centers and brand stores require in-depth insights of passenger flows and on-site behavior to build data associations between consumers, goods, and venues. This allows retail businesses to improve their offline operational management efficiency, enhance the consumer experience, and ultimately promote business growth.

Image for post
Image for post

In addition, target tracking technology is used in crime prevention and investigation. Often, a crime investigation involves a large amount of video reviewing, so it is difficult to review manually. Target detection and tracking technology helps to extract the relevant information in 24 hours of video into a clip of just a few minutes to be viewed. People and objects in the video can be identified, and their trajectories can be tracked. This technology also allows you to play back specific trajectories across different time periods. If you are interested in a certain trajectory or a certain type of trajectory, you can choose to play back video content of this type, greatly reducing the time spent watching videos.

Image for post
Image for post

Conclusion

Get to know our core technologies and latest product updates from Alibaba’s top senior experts on our Tech Show series

By Jin Rong, VP of Research at Alibaba DAMO Academy

Image for post
Image for post

Artificial Intelligence (AI) has developed rapidly over recent years. From its infancy to its current prosperity, AI has given rise to many excellent practices. However, it also faces many challenges, which need to be overcome through technological innovation. Through this article, Jin Rong, VP of Research at the Alibaba DAMO Academy, will introduce the current applications and practices of AI, innovations in AI, and future avenues of exploration.

This article is divided into four parts:

  • Technical Background
  • Natural Language Processing
  • Speech Technology
  • Machine Vision

Part 1: Technical Background

Image for post
Image for post

1.1 Machine Learning

Image for post
Image for post

1.2 Deep Learning

Image for post
Image for post

1.3 Critical Points in the Development of AI

Part 2: Natural Language Processing

2.1 NLP Models

Image for post
Image for post

Given the limitations of traditional methods, deep learning can be used for good effect in NLP. The most successful type of deep learning model in this field is the deep language model. It differs from traditional methods in that the context information of all words is represented by tensors. It also can use bidirectional expressions, which means it can predict both the future and the past. In addition, the deep language model uses the Transformer structure to better capture the relationships between words.

Image for post
Image for post

Natural Language Models — Q&A Applications

Traditional Q&A applications use frequently asked question (FAQ) pairs and knowledge-based question answering (KBQA). For example, in the following figure, the Q&A pairs are a database that contains the questions and answers. This approach is relatively conservative, and the people who compile the questions and answers must have a deep understanding of the relevant domain. In addition, it is difficult to expand the domain and cold start is slow. To solve these problem, researchers introduced machine reading comprehension technology, which can automatically find answers to questions from a document. It does this by converting questions and documents into semantic vectors through the deep language model, and then finding matched answers.

Currently, Q&A applications are widely used by major enterprises, such as AlimeBot and Xianyu. Every day, these applications help millions of buyers obtain information about products and campaigns through self-service.

Image for post
Image for post

2.2 NLP — Machine Translation

Image for post
Image for post

As shown in the following figure, Google Brain’s evaluation report on neural networks shows that phrase-based translation models provide limited performance, whereas translation models based on neural networks have improved significantly. Machine translation is already widely used in Alibaba businesses, such as for the translation of product information in e-commerce businesses and AI translation in DingTalk. However, there is still room for DingTalk AI translation to improve in the future because DingTalk conversations are very casual and do not follow strict grammatical rules.

Image for post
Image for post

Part 3: Speech Technology

Image for post
Image for post

Generally, speech recognition involves two models, a language model and an acoustic model. The main function of a language model is to predict the probability of a word or word sequence. An acoustic model predicts the probability of generating feature X based on the pronunciation of the word W.

Image for post
Image for post

3.1 Speech Recognition

The traditional hybrid speech recognition system is called GMM-HMM (Gaussian Mixture Model-Hidden Markov Model). The GMM is used as the acoustic model, and the HMM is used as the language model. Even though great efforts have been made in the field of speech recognition, machine speech recognition still cannot be compared to human speech recognition. After 2009, speech recognition systems based on deep learning began to develop. In 2017, Microsoft claimed that their speech recognition system showed significant improvements over traditional speech recognition systems and was even superior to human speech recognition.

Traditional hybrid speech recognition systems contain independently optimized acoustic models, language models, and linguist-designed pronunciation models. As you can clearly see, the construction process of traditional speech recognition systems is very complicated. It requires the parallel development of multiple components, with each model independently optimized, resulting in unsatisfactory overall optimization results.

Image for post
Image for post

End-to-End Speech Recognition Systems

Learning from the problems of traditional speech recognition systems, end-to-end speech recognition systems combine acoustic models, decoders, language models, and pronunciation models for unified development and optimization. This allows such systems to achieve optimal performance. Experimental results show that end-to-end speech recognition systems can further reduce the error rate during recognition by over 20%. In addition, the model size is significantly reduced to only several tenths that of a traditional speech recognition model. Also, end-to-end speech recognition systems can work in the cloud.

Image for post
Image for post

3.2 Speech Synthesis

Image for post
Image for post

History of Speech Synthesis

Speech synthesis technology originally started with GMMs, transitioned to HMMs in 2000, and then evolved to deep learning models in 2013. In 2016, WaveNet made a qualitative improvement in speech quality compared with previous models. End-to-end speech synthesis models emerged in 2017. In 2018, Alibaba’s Knowledge-aware Neural model not only could produce speech with good acoustic quality, but also with significantly reduced model size and with improved computing efficiency, therefore enabling effective real-time speech synthesis.

Image for post
Image for post

There has been a large borderline area in speech synthesis, that is, high customization cost. Generally, traditional speech customization requires professional speakers, a recording studio, precise manual labeling, and a large amount of audio data with a length typically over an hour. Nowadays, speech synthesis is attempting to provide personalized voice customization. With this feature, normal users can complete voice customization by recording on their mobile phones, even in a noisy environment. For example, you could change the voice of your in-car navigation system to the voice of a family member.

Image for post
Image for post

3.3 Multimodal Speech Interaction Solutions

For example, if we want to design voice-interaction ticketing kiosks for subway stations, we must account for the fact that the user will be surrounded by many other people, who may also be talking. To solve this problem, we can give the kiosk vision capabilities, so it can identify the user who wishes to buy a ticket based on face size. The following figure shows the primary algorithm processing flow for target speaker speech separation based on facial feature monitoring information. The final step is the input of extracted audiovisual feature to a source mask estimation model based on audiovisual fusion.

Image for post
Image for post

Applications of Audiovisual Fusion Technology

Audiovisual fusion technology is already widely used in many aspects of our daily life. It is used in major transportation hubs in Shanghai, such as subway stations, Hongqiao Railway Station, Shanghai Railway Station, Shanghai South Railway Station, Hongqiao Airport, and Pudong Airport. Since March 2018, more than one million visitors have been served by audiovisual fusion applications. In addition, at the Hangzhou Computing Conference held in September 2018, the smart ordering machine based on multimodal technology, a collaboration of the DAMO Academy and KFC, took 4,500 orders in 3 days. In August 2019, DingTalk launched its new smart office hardware product M 25, which uses multimodal interaction technology for more effective interaction in noisy environments.

Part 4: Vision Technology

4.1 Image Search

Image for post
Image for post

Currently, image search applications face three major challenges. First, they must deal with more and more data, with training data sets containing billions of samples. Second, image search must handle hundreds of millions of classifications. Third, model complexity is continuously increasing.

Image for post
Image for post

To solve these challenges, Alibaba launched Jiuding, a large-scale AI training engine. Jiuding is a large-scale training vehicle and an expert system, covering vision, NLP, and other fields. Jiuding consists of two parts. The first part is communication. As all large-scale training requires multi-machine and multi-GPU architectures, finding ways to effectively improve the model training under such architectures and reduce the cost of communication is a very important area of research. The other part is optimization algorithms. Implementing distributed optimization is also a major challenge. This large-scale training engine can classify large volumes of data and achieve ideal training results. For example, it can complete ImageNet ResNet50 training in 2.8 minutes. To process hundreds of millions of IDs, training for the classification of billions of images can be completed within seven days.

Image for post
Image for post

Image Search Applications

Image search is widely used in real-life scenarios. Currently, Pailitao can process ultra-large-scale image recognition and search tasks, including tasks that involve more than 400 million products, more than 3 billion images, and more than 20 million active users. It can identify more than 30 million entities, such as SKU commodities, animals, plants, vehicles, and more.

Image for post
Image for post

Tianxun is a recognition and analysis application for remote sensing images. It can carry out large-scale remote sensing image training, and process tasks such as road network extraction, terrain classification, new building identification, and illegal building identification.

Image for post
Image for post

4.2 Image Segmentation

Image for post
Image for post

In contrast, as shown in the figure below, segmentation technology based on deep learning uses supervised learning, which incorporates many training samples. Segmentation and classification results can be obtained at the same time, so the machine can understand the instance attributes of each pixel. When given a large volume of relevant data, the encoder and decoder model can finely partition the edges of the object.

Image for post
Image for post

Image Segmentation Applications

Alibaba applies image segmentation technology to all categories of products in Taobao ecosystem. With this technology, Alibaba can automatically generate product images on a white background to accelerate product publishing.

Image for post
Image for post

In addition, this technology can also be used to mix and match clothing and apparel. When merchants provide model images, segmentation technology can be used to dress the models in different clothing.

Image for post
Image for post

4.3 Model Compression

Model compression methods have developed over a long period of time. For the model shown in the following figure, unimportant boundaries can be removed to sparsify the model. Then, the model boundaries are quantified and given different weights. Finally, the model is branched to change its structure. By using a field-programmable gate array (FPGA) acceleration solution, we can speed up the model inference by a factor of 170 compared to GPUs under the same QPS conditions (only 174us is required for RESNet-18.)

Image for post
Image for post

In essence, model compression changes the structure of a model. The selection of a model structure is a difficult problem. It is not a common optimization problem as different structures are in discrete spaces. The cargotainer method proposed by Alibaba can obtain accurate pseudo gradients more quickly. In fact, it won the Low-Power Image Recognition competition held at the 2019 International Conference on Computer Vision.

Image for post
Image for post

Applications of Model Compression Technology

The FPGA-based solution is applied in Hema self-service POS terminals, where a computer vision method is used to identify whether a product has been missed out during the bar-code scanning. This cuts the GPU cost by half. In addition, our proprietary high-efficiency detection algorithm can complete various behavior analysis tasks within 1 second, and the classification accuracy of scanning actions exceeds 90%. The scenario classification accuracy is above 95%.

Image for post
Image for post

4.4 Target Detection

Image for post
Image for post

Target detection technologies have also been around for a long time. Traditional detection methods, such as histogram of oriented gradients (HoG) and deformable part model (DPM), rely on handcrafted features, which means manually selected features. The problem with such methods is that they have poor robustness, cannot be generalized, and the computation workload is highly redundant. Currently, many new target detection methods based on deep learning have been proposed, such as Faster R-CNN, SSD, RetinaNet, and FCOS. Their advantages lie in that the machine can identify features itself, they can better cope with object size and appearance changes, and they provide good generalization performance. In fact, from 2008 to 2019, the accuracy of target detection increased from about 20% to about 83%.

4.5 Target Tracking

Image for post
Image for post

Target Detection and Tracking Applications

Target tracking is extensively applied in new retail scenarios. Shopping centers and brand stores require in-depth insights of passenger flows and on-site behavior to build data associations between consumers, goods, and venues. This allows retail businesses to improve their offline operational management efficiency, enhance the consumer experience, and ultimately promote business growth.

Image for post
Image for post

In addition, target tracking technology is used in crime prevention and investigation. Often, a crime investigation involves a large amount of video reviewing, so it is difficult to review manually. Target detection and tracking technology helps to extract the relevant information in 24 hours of video into a clip of just a few minutes to be viewed. People and objects in the video can be identified, and their trajectories can be tracked. This technology also allows you to play back specific trajectories across different time periods. If you are interested in a certain trajectory or a certain type of trajectory, you can choose to play back video content of this type, greatly reducing the time spent watching videos.

Image for post
Image for post

Conclusion

Get to know our core technologies and latest product updates from Alibaba’s top senior experts on our Tech Show series

Original Source:

Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store