Interspeech 2017 | Far-field Speech Recognition Technology

Image for post
Image for post

From October 25 2017, Alibaba iDST Voice Team and Alibaba Cloud Community has been working together on a series of information sharing meetings regarding voice technology, in an effort to share the technological progress reported in Interspeech 2017.

Let us now take a look at the topic that was discussed in this session: far-field speech recognition technology:

1. Introduction to Far-field Speech Recognition Technology

1.1. What is far-field speech recognition?

1.2. Modules of Far-field Speech Recognition System

1.2.1. Front-end Signal Processing Module

Beamforming is another typical front-end signal processing method, that determines the sound source (DOA) by comparing the arrival time of different sounds and the distance between microphones. Once the position of target sound is determined, various audio signal processing methods such as spatial filtering can be used to decrease noise disturbance and improve signal quality. Typical beamforming methods include delay and sum (DS) and minimum variance distortionless response (MVDR).

Recent years have seen a rapid development of speech enhancement technology based on deep neural networks (NN). For the NN-based speech enhancement, the input is usually speech in noise, and the “cleansed” speech is expected with powerful NN-based nonlinear modelling capability. Some representative methods are feature mapping (Xu, 2015) and ideal ratio mask (Wang, 2016).

1.2.2. Back-end Speech Recognition Module

Image for post
Image for post

CNN technology has been used for speech recognition since 2012–2013. Back then, the convolution layer and pooling layer were alternated, with the convolution kernel being huge in size but without many layers. The objective was further processing of features and classification of DNN. Things have changed With the evolution of CNN technology in the imaging area, things have changed. It has been concluded that a deeper and better CNN models can be trained when the multi-layer convolution is connected with the pooling layer and the size of convolution kernel decreases. This approach has been applied and refined according to the characteristics of speech recognition.

The LSTM (long short-term memory) model is a special type of recurrent neural network (RNN). Speech recognition is in fact a process of time-sequential modelling, and therefore the RNN is very suitable for the modelling. As a simple RNN is constrained by gradient explosion and gradient dissipation, the training will be very difficult. An LSTM model can better control the signal flow and delivery with input, output and forget gates, and long and short-term memory. It can also mitigate the gradient explosion and dissipation of RNN to some extent. The shortcoming lies in the fact that the calculation is much more complex than DNN, and parallel processing is very difficult due to the recursive connection.

Compared with LSTM model, BLSTM model improves the modelling capability and takes in account the effect of reverse timing information, that is, the impact of the “future” on the “present”, which is important in speech recognition. However, this capability will increase the complexity of modelling calculation and requires training with complete sentences: increase in GPU memory consumption -> decrease in the degree of parallelism -> slower model training. Additionally, there are real-time issues in the actual applications. We use a latency-control BLSTM model to overcome these challenges, and we have released the first BLSTM-DNN hybrid speech recognition system in the industry.

2. Introduction to Papers on Far-field Speech Recognition Systems in Interspeech 2017

2.1. Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition

Image for post
Image for post

Compared with the traditional LSTM and highway LSTM structures, the modified network has the following three advantages:

  1. Less network parameters (decrease by 10% with network configuration in paper)
  2. Easier training, thanks to the two advantages of residual structureavoiding excessive data processing through nonlinear transformation during front calculation, and restraining gradient dissipation through direct path during the counter propagation of error;
  3. significant improvement in the final accuracy of recognition, and the degradation problem is eliminated when the layers of neural network increase to 10.
    The experiment was conducted on a far-field open dataset AMI. This dataset was a simulated meeting scenario, and the data comprised of far-field recorded data and corresponding near-speaking data. Tests were conducted on two datasets with/without coincident speech interference, with results as we have discussed before.
Image for post
Image for post

2.2. Generation of Large-scale Simulated Utterances in Virtual Rooms to Train Deep-neural Networks for Far-field Speech Recognition in Google Home

Image for post
Image for post

The impulse response in the room can be generated through image method. The number of noise points was randomly selected between 0–3. The noise-signal ratio of simulated far-field data was 0–30dB. The distance between the target speaker andmicrophone array was 1–10m.

The fCLP-LDNN model was used for acoustic modelling. The model structure and final results are shown in the diagram below. When there was noise and interference, the robustness of the acoustic model generated from simulated far-field data training was much better than that of the model generated from near-field “clean” data training, and the word error rate decreased by 40%. The data training method in the paper was used in the model training of Google Home products.

Image for post
Image for post

3. Conclusion and Technical Outlook

We expect that the far-field speech recognition technology will become more sophisticated and easier to use with joint efforts of the academic and industrial communities.


Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store