Interview with Dr. Yu Kai of AISpeech — The Importance of Naturalness in Natural Language Processing
“The key to the future of natural language technology is just one word — Natural.”
On the last day of November, Dr. Yu Kai, chief scientist and co-founder of AISpeech, made this comment at the seventh class of the Artificial Intelligence Study Club sponsored by Tsinghua’s X-lab. He was delivering a speech to the audience on the future potential of natural language processing (NLP).
In his lecture “Intelligence in Cognitive Verbal Dialogs,” Yu Kai mentions that the greatest challenge in cognitive interactions is not speech, as it is a clearly defined problem from the perspective of speech recognition and most programs designed for this purpose perform exceptionally well.
Image 1: November 30, AISpeech co-founder Yu Kai gives a lecture entitled “Intelligence in Cognitive Verbal Dialogs” at Tsinghua. Photo credit: Liu Han
He believes that the bigger challenge is the dialog process. For example, in therapy for depression sufferers, verbal dialogs are more like purposeful chats. Without a strong mathematical background, the participants will find it very difficult to continue the conversation. Only by accumulating more data in a vertical field can help in improving the conversation.
Below is the transcript of Dr. Yu Kai’s lecture; we have edited it slightly for clarity and conciseness.
Moving from Vocal Recognition to Vocal Interaction
Today, I want to talk about cognitive intelligence dialogs, the operative word being “dialog.” This word not only refers to speech but also to language itself. In my eight years at Tsinghua, studying man-machine interaction, we have gone through several shifts in the ways humans and machines interact.
Why did we start paying attention to intelligence in verbal dialogs?
This first thing we will discuss today is why we started paying attention to intelligence in verbal dialogs.
In the beginning, we had Windows graphical user interface for interaction between people and information. At that time, we were amazed to see the information neatly printed out. Then, starting in 2011, smartphones, which evolved from cell phones, became increasingly prevalent. Moving on, we gradually started interacting with machines using natural language processing, whether textual or verbal. As time went by, we realized that verbal communication needs be at the core of smart information acquisition in the future. In the era of mobile Internet, a new mode of this communication emerged as the most critical piece of the answer — voice interaction.
Image 2: Lecture venue. Photo credit: Liu Han
Before the beginning of this millennium, when search engines like Google or Baidu had just come into being, the interaction was one-way. However, since the advent of smartphones, interaction has become two-way. For instance, the first generation of iPhone was not capable of voice interaction, but then market research found that 75% of users wanted voice control functions. Therefore, the next two generations added voice control. However, the company was surprised to learn that less than 5% of people used this function. Apple concluded that their users wanted more than voice control — they wanted natural language interaction. Hence, Apple launched Siri with the iPhone 4S. Subsequent market research showed that about 87% of users interacted with Siri at least once a month.
However, they also discovered that these iPhone users were teasing Siri for fun or curiosity, most of the time, rather than trying to accomplish anything useful. As a result, Apple could not capitalize on this invention. It compelled Apple to acquire VocalIQ, a company that specialized in collecting statistics on dialog interactions, in 2015. That acquisition gave Apple the power to form a closed loop that unified technical speech recognition and semantics, and improved Siri with all the new functions.
Image 3: Lecture venue. Photo credit: Liu Han
We all talk about the Internet age, but to what extent have information systems progressed?
Looking at the statistics, at the end of 2017, the number of IoT smart devices worldwide exceeded the human population for the first time. However, a vast majority of these devices have tiny screens or no screens at all, and users cannot perform complex operations on them. This means that to access the complex abstract information, users only can interact with such devices vocally or via dialog. This is why, starting in 2014, many tech giants began releasing smart speakers. From a technological perspective, this calls for more than a solution or a technology framework. It also involves dialog management, recognition, synthesis, and our understanding.
Problems and Opportunities of a Natural Vocal Interaction System
What are the main problems and are there any opportunities?
The first thing is speech recognition. Speech recognition is a cutting-edge perception technology, and most people are already aware of its applications. Businesses and researchers have already solved the main challenges in speech recognition. If I use a comprehensive speech recognition system, it will have no problem recognizing most of what I say, even poetry. However, even if we use deep-learning technology, we cannot avoid occasional speech recognition errors. Our task is to make the program more human so that, when it makes a mistake, it can correct itself in the context of the complete man-machine interaction. This requires the mutual assistance of perception and cognitive technologies.
The second thing is computing power. Speech recognition solutions are dependent on computing power. To give an example, in the demo I just performed, the demo app used a deep neural network with a total of seven layers of 2,048 nodes each, 1,320 inputs, nearly 10,000 outputs, and a total of about 45 million parameters. When performing speech recognition, we cut each second of speech into 100 pieces and extract 1,320 vectors from every section. Now imagine that I compute an eigenvector in the neural network 100 times per second and then need to find it in a search network of hundreds of millions of nodes. As you can see, this operation is incredibly complex. Statistics show that, if we divided the speech recognition process into search speed and neural network forward transmission speed, in traditional systems the forward transmission speed would account for 30% to 40%. The total search speed in the various language spaces would account for 60% to 70%. Therefore, at the technical level, we need to solve the speed problem.
Image 4: Questions from the audience. Photo credit: Liu Han
Another issue with perception intelligence is how to make it more lightweight. Overall changes and advances in information technology are undoubtedly related to advances in basic technology. We constantly face new challenges, such as whether or not we can increase noise-reduction performance to 90% or use large vocabularies on cell phones and watches. As we make various improvements to the IoT technology, we can start to overcome these challenges one after another.
Making Interactions Natural through Cognition
Cognition is the most frustrating aspect. The man-machine dialog is not as simple as most people imagine, because there are many forms of dialog, some of which technology can achieve more effectively than others. If we sorted dialogs based on the number of rounds, we could divide them into several categories. First, the shortest dialog form would be a single round. For example, I would say a sentence, and the machine would respond with a phrase, with no specific structural semantics. This is a command-type dialog and is extremely simple. A more complex form of dialog is ‘question and answer.’ Currently, many systems rely on conventional deep learning technologies to solve problems with ‘question and answer’ dialogs. Because the structure of such a dialog is usually a single question and then a single answer, with only occasional context; this is not a valid multi-round dialog.
Another category is chatting, that is, if you do not stop talking, the machine will continue chatting with you. Microsoft XiaoIce is an example of this type of chat dialog. The chatting time defines its chat guidelines. A user can chat for several hours, and the conversation would continue. However, there is no set purpose in this chat, so the main thing to consider is how to inject exciting stuff into the conversation.
However, if the user has a purpose in chatting, the machine will not be able to understand it; yet, it will continue to chat as long as the user hangs around. We characterize a chat by multiple rounds of interaction without any sort of structure. The machine may occasionally add some informative elements, which is what researchers hope to integrate more into the machine, but the interaction is primarily unstructured. Therefore, features like chatting are in fact more about integrating some interesting elements into the machine. To be honest, we still do not have a stable theoretical system concerning dialogs of this type that would help us solve associated problems at the theoretical level.
The final type of dialog is a task-oriented multi-round dialog. We have a firm mathematical basis for this type of dialog that allows us to see it as a sequence of decision-making processes.
One Technology, Three Levels
Looking at the level of cognition, we can split cognitive technology into three levels.
The first is the static level. This determines if the program can understand the natural language of a random statement and map it to the correct meaning.
The second level is interactive decision making. This determines how the program will respond to a conversation. For instance, if I say that I am looking for a restaurant, the program should understand where I want to go and what I want to eat.
The third level is evolution. If the program thinks that I want something expensive when in fact, I want something cheap, it should be able to recognize its mistake and update its response strategy in the future, so its cognitive abilities evolve.
Talking about something that concerns you: Large-scale custom conversational intelligence. Looking at conversational intelligence as a whole, we will see that each scenario throughout the process may seem to be great, but things change when it comes to professional scenarios. For example, in dialog mode, shopping scenarios are different from financial or home scenarios regarding the information that a system needs to understand. This means we have to validate whether a dialog model can recognize and support each scenario.
From the perspective of details, there are many personalized requirements, such as wake-up calls. For example, if we say “Alexa, play a song,” the name “Alexa” is a wake-up call. However, sometimes we might want to give a name to the machine. In the future, there will be more such requirements for personalization like using personalized wake-up words.
We hope that our verbal dialog system will support customization. Moreover, large-scale customization is a new concept we proposed first. In 2013, we launched a platform called “Dialog Workshop.” In 2017, we upgraded this platform to the “Dialog User Interface (DUI),” which features large-scale customization. In essence, what it does is to integrate a graphical interface and a speech interface in an interactive dialog framework.
What can custom voice interaction technology do?
Now, you must be curious. What does this customization technology do? For example, when developing the technologies for real-time speech recognition and large-vocabulary speech recognition, we can create a function that during semantic changes enables automatic speech recognition of the words. For example, if we add the name of a movie star, say “Nicole Kidman,” the system should be able to automatically add it to the word list and recognize it as the actress’ name for subsequent understanding and interaction.
Another example is a project we plan to undertake. We want to build an onboard vehicle system that can automatically add and use different voices. If we prefer to listen to the sweet voice of Katy Perry for navigation, we just need to say “Katy Perry.” The system will not mistakenly speak in the husky voice of Michael Cane. If we tell the system to go back to the previous voice, it will switch back to the one used before. We hope that the machine will be capable of switching quickly back and forth. Taking it further, we want to support customization of features related to understanding and dialog.
In this process, the supporting technology is beyond the traditional voice or dialog interaction and the independent perception and cognitive framework we mentioned previously. At that point, we are going to need the new technologies that support large-scale customization. For example, concerning recognition, we need to solve the challenge of self-adaption.
To be more specific, a machine that can be self-adaptive to the speaker and the scenario, or a particular subject, and make adjustments in time to enable more self-adaption of dialogs. We cannot achieve such self-adaption in large scale without the support of related systems. During this journey, we need to borrow specific technologies and customize models to scale out and make headways by personalization features. Many new technologies will emerge in this area, but none of them can last without the support of technical infrastructure.
You can read similar articles and learn more about Alibaba Cloud’s products and solutions at www.alibabacloud.com/blog.