Spoken Language Processing | InfWeb
Recommended for you
Speech recognition techniques. Upcoming SlideShare.
Like this presentation? Why not share! Deep Learning for Speech Recognitio Group members. Latest publications. A user study to compare two conversational assistants designed for people with hearing impairments. Publishing year: Statistical models of morphology predict eye-tracking measures during visual word recognition.
In the United States, the National Security Agency has made use of a type of speech recognition for keyword spotting since at least Recordings can be indexed and analysts can run queries over the database to find conversations of interest. Some government research programs focused on intelligence applications of speech recognition, e. In the early s, speech recognition was still dominated by traditional approaches such as Hidden Markov Models combined with feedforward artificial neural networks.
The use of deep feedforward non-recurrent networks for acoustic modeling was introduced during later part of by Geoffrey Hinton and his students at University of Toronto and by Li Deng  and colleagues at Microsoft Research, initially in the collaborative work between Microsoft and University of Toronto which was subsequently expanded to include IBM and Google hence "The shared views of four research groups" subtitle in their review paper.
Researchers have begun to use deep learning techniques for language modeling as well. In the long history of speech recognition, both shallow form and deep form e. Most speech recognition researchers who understood such barriers hence subsequently moved away from neural nets to pursue generative modeling approaches until the recent resurgence of deep learning starting around — that had overcome all these difficulties.
Hinton et al. By early s speech recognition, also called voice recognition    was clearly differentiated from sp eaker recognition, and speaker independence was considered a major breakthrough. Until then, systems required a "training" period.
A ad for a doll had carried the tagline "Finally, the doll that understands you. In , Microsoft researchers reached a historical human parity milestone of transcribing conversational telephony speech on the widely benchmarked Switchboard task. Multiple deep learning models were used to optimize speech recognition accuracy. The speech recognition word error rate was reported to be as low as 4 professional human transcribers working together on the same benchmark, which was funded by IBM Watson speech team on the same task. Both acoustic modeling and language modeling are important parts of modern statistically-based speech recognition algorithms.
Hidden Markov models HMMs are widely used in many systems.
- Office Hours!
- Speech processing.
Language modeling is also used in many other natural language processing applications such as document classification or statistical machine translation. Modern general-purpose speech recognition systems are based on Hidden Markov Models. These are statistical models that output a sequence of symbols or quantities. HMMs are used in speech recognition because a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal.
In a short time-scale e. Speech can be thought of as a Markov model for many stochastic purposes. Another reason why HMMs are popular is because they can be trained automatically and are simple and computationally feasible to use. In speech recognition, the hidden Markov model would output a sequence of n -dimensional real-valued vectors with n being a small integer, such as 10 , outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform , then taking the first most significant coefficients.
The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians, which will give a likelihood for each observed vector. Each word, or for more general speech recognition systems , each phoneme , will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech recognition.
Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes so phonemes with different left and right context have different realizations as HMM states ; it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization VTLN for male-female normalization and maximum likelihood linear regression MLLR for more general speaker adaptation.
The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis HLDA ; or might skip the delta and delta-delta coefficients and use splicing and an LDA -based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semi-tied co variance transform also known as maximum likelihood linear transform , or MLLT. Many systems use so-called discriminative training techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data.
Spoken Language Processing
Decoding of the speech the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model, which includes both the acoustic and language model information, and combining it statically beforehand the finite state transducer , or FST, approach. A possible improvement to decoding is to keep a set of good candidates instead of just keeping the best candidate, and to use a better scoring function re scoring to rate these good candidates so that we may pick the best one according to this refined score.
The set of candidates can be kept either as a list the N-best list approach or as a subset of the models a lattice. Re scoring is usually done by trying to minimize the Bayes risk  or an approximation thereof : Instead of taking the source sentence with maximal probability, we try to take the sentence that minimizes the expectancy of a given loss function with regards to all possible transcriptions i. The loss function is usually the Levenshtein distance , though it can be different distances for specific tasks; the set of possible transcriptions is, of course, pruned to maintain tractability.
Efficient algorithms have been devised to re score lattices represented as weighted finite state transducers with edit distances represented themselves as a finite state transducer verifying certain assumptions. Dynamic time warping is an approach that was historically used for speech recognition but has now largely been displaced by the more successful HMM-based approach. Dynamic time warping is an algorithm for measuring similarity between two sequences that may vary in time or speed. For instance, similarities in walking patterns would be detected, even if in one video the person was walking slowly and if in another he or she were walking more quickly, or even if there were accelerations and deceleration during the course of one observation.
A well-known application has been automatic speech recognition, to cope with different speaking speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences e. That is, the sequences are "warped" non-linearly to match each other. This sequence alignment method is often used in the context of hidden Markov models. Neural networks emerged as an attractive acoustic modeling approach in ASR in the late s. Since then, neural networks have been used in many aspects of speech recognition such as phoneme classification,  isolated word recognition,  audiovisual speech recognition, audiovisual speaker recognition and speaker adaptation.
Neural networks make fewer explicit assumptions about feature statistical properties than HMMs and have several qualities making them attractive recognition models for speech recognition. When used to estimate the probabilities of a speech feature segment, neural networks allow discriminative training in a natural and efficient manner. However, in spite of their effectiveness in classifying short-time units such as individual phonemes and isolated words,  early neural networks were rarely successful for continuous recognition tasks because of their limited ability to model temporal dependencies.
One approach to this limitation was to use neural networks as a pre-processing, feature transformation or dimensionality reduction,  step prior to HMM based recognition. Deep Neural Networks and Denoising Autoencoders  are also under investigation. A deep feedforward neural network DNN is an artificial neural network with multiple hidden layers of units between the input and output layers.
DNN architectures generate compositional models, where extra layers enable composition of features from lower layers, giving a huge learning capacity and thus the potential of modeling complex patterns of speech data.
Related Modern Methods of Speech Processing
Copyright 2019 - All Right Reserved