Interview with Dr. Yu Kai of AISpeech — The Importance of Naturalness in Natural Language Processing

“The key to the future of natural language technology is just one word — Natural.”

On the last day of November, Dr. Yu Kai, chief scientist and co-founder of AISpeech, made this comment at the seventh class of the Artificial Intelligence Study Club sponsored by Tsinghua’s X-lab. He was delivering a speech to the audience on the future potential of natural language processing (NLP).

In his lecture “Intelligence in Cognitive Verbal Dialogs,” Yu Kai mentions that the greatest challenge in cognitive interactions is not speech, as it is a clearly defined problem from the perspective of speech recognition and most programs designed for this purpose perform exceptionally well.

Image for post
Image for post

Image 1: November 30, AISpeech co-founder Yu Kai gives a lecture entitled “Intelligence in Cognitive Verbal Dialogs” at Tsinghua. Photo credit: Liu Han

He believes that the bigger challenge is the dialog process. For example, in therapy for depression sufferers, verbal dialogs are more like purposeful chats. Without a strong mathematical background, the participants will find it very difficult to continue the conversation. Only by accumulating more data in a vertical field can help in improving the conversation.

Below is the transcript of Dr. Yu Kai’s lecture; we have edited it slightly for clarity and conciseness.

Moving from Vocal Recognition to Vocal Interaction

Why did we start paying attention to intelligence in verbal dialogs?

In the beginning, we had Windows graphical user interface for interaction between people and information. At that time, we were amazed to see the information neatly printed out. Then, starting in 2011, smartphones, which evolved from cell phones, became increasingly prevalent. Moving on, we gradually started interacting with machines using natural language processing, whether textual or verbal. As time went by, we realized that verbal communication needs be at the core of smart information acquisition in the future. In the era of mobile Internet, a new mode of this communication emerged as the most critical piece of the answer — voice interaction.

Image for post
Image for post

Image 2: Lecture venue. Photo credit: Liu Han

Before the beginning of this millennium, when search engines like Google or Baidu had just come into being, the interaction was one-way. However, since the advent of smartphones, interaction has become two-way. For instance, the first generation of iPhone was not capable of voice interaction, but then market research found that 75% of users wanted voice control functions. Therefore, the next two generations added voice control. However, the company was surprised to learn that less than 5% of people used this function. Apple concluded that their users wanted more than voice control — they wanted natural language interaction. Hence, Apple launched Siri with the iPhone 4S. Subsequent market research showed that about 87% of users interacted with Siri at least once a month.

However, they also discovered that these iPhone users were teasing Siri for fun or curiosity, most of the time, rather than trying to accomplish anything useful. As a result, Apple could not capitalize on this invention. It compelled Apple to acquire VocalIQ, a company that specialized in collecting statistics on dialog interactions, in 2015. That acquisition gave Apple the power to form a closed loop that unified technical speech recognition and semantics, and improved Siri with all the new functions.

Image for post
Image for post

Image 3: Lecture venue. Photo credit: Liu Han

We all talk about the Internet age, but to what extent have information systems progressed?

Problems and Opportunities of a Natural Vocal Interaction System

What are the main problems and are there any opportunities?

The second thing is computing power. Speech recognition solutions are dependent on computing power. To give an example, in the demo I just performed, the demo app used a deep neural network with a total of seven layers of 2,048 nodes each, 1,320 inputs, nearly 10,000 outputs, and a total of about 45 million parameters. When performing speech recognition, we cut each second of speech into 100 pieces and extract 1,320 vectors from every section. Now imagine that I compute an eigenvector in the neural network 100 times per second and then need to find it in a search network of hundreds of millions of nodes. As you can see, this operation is incredibly complex. Statistics show that, if we divided the speech recognition process into search speed and neural network forward transmission speed, in traditional systems the forward transmission speed would account for 30% to 40%. The total search speed in the various language spaces would account for 60% to 70%. Therefore, at the technical level, we need to solve the speed problem.

Image for post
Image for post

Image 4: Questions from the audience. Photo credit: Liu Han

Another issue with perception intelligence is how to make it more lightweight. Overall changes and advances in information technology are undoubtedly related to advances in basic technology. We constantly face new challenges, such as whether or not we can increase noise-reduction performance to 90% or use large vocabularies on cell phones and watches. As we make various improvements to the IoT technology, we can start to overcome these challenges one after another.

Making Interactions Natural through Cognition

Another category is chatting, that is, if you do not stop talking, the machine will continue chatting with you. Microsoft XiaoIce is an example of this type of chat dialog. The chatting time defines its chat guidelines. A user can chat for several hours, and the conversation would continue. However, there is no set purpose in this chat, so the main thing to consider is how to inject exciting stuff into the conversation.

However, if the user has a purpose in chatting, the machine will not be able to understand it; yet, it will continue to chat as long as the user hangs around. We characterize a chat by multiple rounds of interaction without any sort of structure. The machine may occasionally add some informative elements, which is what researchers hope to integrate more into the machine, but the interaction is primarily unstructured. Therefore, features like chatting are in fact more about integrating some interesting elements into the machine. To be honest, we still do not have a stable theoretical system concerning dialogs of this type that would help us solve associated problems at the theoretical level.

The final type of dialog is a task-oriented multi-round dialog. We have a firm mathematical basis for this type of dialog that allows us to see it as a sequence of decision-making processes.

One Technology, Three Levels

The second level is interactive decision making. This determines how the program will respond to a conversation. For instance, if I say that I am looking for a restaurant, the program should understand where I want to go and what I want to eat.

The third level is evolution. If the program thinks that I want something expensive when in fact, I want something cheap, it should be able to recognize its mistake and update its response strategy in the future, so its cognitive abilities evolve.

Talking about something that concerns you: Large-scale custom conversational intelligence. Looking at conversational intelligence as a whole, we will see that each scenario throughout the process may seem to be great, but things change when it comes to professional scenarios. For example, in dialog mode, shopping scenarios are different from financial or home scenarios regarding the information that a system needs to understand. This means we have to validate whether a dialog model can recognize and support each scenario.

From the perspective of details, there are many personalized requirements, such as wake-up calls. For example, if we say “Alexa, play a song,” the name “Alexa” is a wake-up call. However, sometimes we might want to give a name to the machine. In the future, there will be more such requirements for personalization like using personalized wake-up words.

We hope that our verbal dialog system will support customization. Moreover, large-scale customization is a new concept we proposed first. In 2013, we launched a platform called “Dialog Workshop.” In 2017, we upgraded this platform to the “Dialog User Interface (DUI),” which features large-scale customization. In essence, what it does is to integrate a graphical interface and a speech interface in an interactive dialog framework.

What can custom voice interaction technology do?

Another example is a project we plan to undertake. We want to build an onboard vehicle system that can automatically add and use different voices. If we prefer to listen to the sweet voice of Katy Perry for navigation, we just need to say “Katy Perry.” The system will not mistakenly speak in the husky voice of Michael Cane. If we tell the system to go back to the previous voice, it will switch back to the one used before. We hope that the machine will be capable of switching quickly back and forth. Taking it further, we want to support customization of features related to understanding and dialog.

In this process, the supporting technology is beyond the traditional voice or dialog interaction and the independent perception and cognitive framework we mentioned previously. At that point, we are going to need the new technologies that support large-scale customization. For example, concerning recognition, we need to solve the challenge of self-adaption.

To be more specific, a machine that can be self-adaptive to the speaker and the scenario, or a particular subject, and make adjustments in time to enable more self-adaption of dialogs. We cannot achieve such self-adaption in large scale without the support of related systems. During this journey, we need to borrow specific technologies and customize models to scale out and make headways by personalization features. Many new technologies will emerge in this area, but none of them can last without the support of technical infrastructure.

You can read similar articles and learn more about Alibaba Cloud’s products and solutions at


Written by

Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store