Speech recognition technology principles and practical system design summary

Speech recognition is based on speech as the research object. Through speech signal processing and pattern recognition, the machine automatically recognizes and understands human spoken language. Speech recognition technology is a high technology that allows machines to transform speech signals into corresponding text or commands through recognition and understanding. Speech recognition is a wide-ranging interdisciplinary subject. It has a very close relationship with acoustics, phonetics, linguistics, information theory, pattern recognition theory, and neurobiology. Speech recognition technology is gradually becoming a key technology in computer information processing technology. The application of speech technology has become a competitive emerging high-tech industry.
1. The basic principle of speech recognition
The speech recognition system is essentially a pattern recognition system, including three basic units such as feature extraction, pattern matching, and reference pattern library. Its basic structure is shown in the following figure:
The unknown voice is converted into an electrical signal by a microphone and then added to the input of the recognition system. First, it is preprocessed, and then a voice model is established according to the characteristics of human voice, the input voice signal is analyzed, and the required features are extracted. Create the template required for speech recognition. In the recognition process, the computer should compare the characteristics of the voice template stored in the computer with the input voice signal according to the model of voice recognition, and according to a certain search and matching strategy, find a series of optimal matches with the input voice template. Then according to the definition of this template, the identification result of the computer can be given by looking up the table. Obviously, this optimal result is directly related to the choice of features, the quality of the speech model, and the accuracy of the template.
2. Development history and current status of speech recognition technology
In 1952, Davis et al. Of AT & TBell Lab developed the first person-specific speech enhancement system capable of ten English digits-Audry system. In 1956, Olson and Belar et al. A system of syllable words that uses the spectral parameters obtained by the band-pass filter bank as speech enhancement features. In 1959, Fry and Denes and others tried to build a phoneme to make 4 vowels and 9 consonants, and use spectrum analysis and pattern matching to make decisions. This greatly improves the efficiency and accuracy of speech recognition. Since then, computer speech recognition has attracted the attention of researchers from various countries and began to enter into the research of speech recognition. In the 1960s, the Soviet Union â€™s MaTIn and others proposed endpoint detection of speech end points, which significantly increased the level of speech recognition; Vintsyuk proposed dynamic programming, which was indispensable in future recognition. The important achievements in the late 1960s and early 1970s were the signal linear predictive coding (LPC) technology and dynamic time warping (DTW) technology, which effectively solved the problem of speech signal feature extraction and unequal length speech matching; at the same time, it proposed Vector Quantization (VQ) and Hidden Markov Model (HMM) theory. The combination of speech recognition technology and speech synthesis technology enables people to get rid of the fetters of the keyboard. Instead, it is an easy-to-use, natural, and humanized input method such as voice input. It is gradually becoming the key technology of human-machine interface in information technology.
3. Voice recognition method
At present, the representative speech recognition methods mainly include dynamic time warping technology (DTW), hidden Markov model (HMM), vector quantization (VQ), artificial neural network (ANN), support vector machine (SVM) and other methods.
Dynamic time warping algorithm (Dynamic TIme Warping, DTW) is a simple and effective method in non-specific person speech recognition. This algorithm is based on the idea of â€‹â€‹dynamic programming and solves the problem of template matching with different pronunciation lengths. It is a voice recognition technology. An earlier and more commonly used algorithm appeared. When applying DTW algorithm for speech recognition, it is to compare the pre-processed and framed speech test signal with the reference speech template to obtain the similarity between them, according to a certain distance measurement to get the similarity between the two templates And choose the best path.
Hidden Markov Model (HMM) is a statistical model in speech signal processing, evolved from Markov chain, so it is a statistical recognition method based on a parameter model. Because its pattern library is formed by repeated training, the best model parameter with the highest probability of matching with the training output signal is not the pre-stored pattern sample, and the likelihood probability between the speech sequence to be recognized and the HMM parameter is used in its recognition process The best state sequence corresponding to the maximum value is used as the recognition output, so it is an ideal speech recognition model.
Vector quantization (Vector QuanTIzaTIon) is an important signal compression method. Compared with HMM, vector quantization is mainly suitable for speech recognition of small vocabulary and isolated words. The process is to combine several scalar data of speech signal waveforms or characteristic parameters into a vector for overall quantization in multi-dimensional space. The vector space is divided into several small regions. Each small region finds a representative vector. The vector that falls into the small region during quantization is replaced by this representative vector. The design of the vector quantizer is to train a good codebook from a large number of signal samples, find a good distortion measurement definition formula from the actual effect, design the best vector quantization system, and use the least amount of search and calculation to calculate the distortion Achieve the largest possible average signal-to-noise ratio.
In the actual application process, people have also studied a variety of methods to reduce complexity, including memoryless vector quantization, memory vector quantization and fuzzy vector quantization methods.
Artificial neural network (ANN) is a new speech recognition method proposed in the late 1980s. It is essentially an adaptive nonlinear dynamics system, simulating the principle of human neural activity, with adaptability, parallelism, robustness, fault tolerance and learning characteristics, and its powerful classification capabilities and input-output mapping capabilities Very attractive in speech recognition. The method is an engineering model that simulates the thinking mechanism of the human brain. It is just the opposite of HMM. Its classification decision-making ability and description of uncertain information are recognized worldwide, but its ability to describe dynamic time signals is not yet satisfactory, usually MLP classifier can only solve the problem of static pattern classification, and does not involve the processing of time series. Although scholars have proposed many structures with feedback, they are still insufficient to characterize the dynamic characteristics of time series such as speech signals. Because ANN can not describe the time dynamic characteristics of speech signals well, ANN is often combined with traditional recognition methods to use their respective advantages for speech recognition to overcome the shortcomings of HMM and ANN. In recent years, significant progress has been made in the research of recognition algorithms combining neural networks and hidden Markov models. Its recognition rate is close to that of hidden Markov model recognition systems, which further improves the robustness and accuracy of speech recognition.
Support vector machine (Support vector machine) is a new learning machine model that uses statistical theory. It uses Structural Risk Minimization (SRM) to effectively overcome the shortcomings of traditional empirical risk minimization methods. Taking into account the training error and generalization ability, it has many excellent performances in solving small samples, nonlinear and high-dimensional pattern recognition, and has been widely used in the field of pattern recognition.
4. Classification of speech recognition system
Speech recognition systems can be categorized according to restrictions on input speech. If considering the relevance of the speaker and the recognition system, the recognition system can be divided into three categories: (1) specific person speech recognition system. Only consider the recognition of the voice of the person. (2) Non-specific person voice system. Recognized speech has nothing to do with people. Usually, a large number of different people's speech databases are used to learn the recognition system. (3) Multi-person identification system. Usually can recognize the voice of a group of people, or become a specific group of voice recognition system, the system only requires training of the voice of the group of people to be recognized.
If considering the way of speaking, the recognition systems can also be divided into three categories: (1) Isolated word speech recognition system. The isolated word recognition system requires a pause after entering each word. (2) Conjunction speech recognition system. The connection word input system requires that each word be pronounced clearly, and some consonants begin to appear. (3) Continuous speech recognition system. Continuous voice input is a continuous fluent input of natural fluency, and a large number of legato and accent will appear.
If considering the vocabulary size of the recognition system, the recognition system can also be divided into three categories: (1) small vocabulary speech recognition system. A speech recognition system that usually includes dozens of words. (2) Speech recognition system with medium vocabulary. The recognition system usually includes hundreds of words to thousands of words. (3) Large vocabulary speech recognition system. Speech recognition systems that usually include thousands to tens of thousands of words. As the computing power of computers and digital signal processors and the accuracy of recognition systems improve, the classification of recognition systems based on vocabulary size also continues to change. It is currently a medium vocabulary recognition system, and it may be a small vocabulary speech recognition system in the future. These different limitations also determine the difficulty of speech recognition systems.
5. Application of voice recognition
The fields where speech recognition can be applied are roughly divided into five categories:
Office or business system. Typical applications include: filling in data forms, database management and control, keyboard function enhancement, and so on.
Manufacturing: In quality control, voice recognition systems can provide a "hands-free" and "eyes-free" inspection (component inspection) for the manufacturing process.
Telecommunications: A very wide range of applications are feasible in dial-up telephone systems, including the automation of operator assistance services, international and domestic remote e-commerce, voice call distribution, voice dialing, and classified orders.
Medical: The main application in this area is to generate and edit professional medical reports from sound.
Others: Including games and toys controlled and operated by voice, voice recognition system to help the disabled, voice control of some non-critical functions in vehicle driving, such as vehicle traffic control system, audio system.
In the future, with the miniaturization and even wearability of handheld devices, various smart glasses and watches will emerge one after another. Of course, it is very important to find a breakthrough in the market. Good solutions and system design references are also essential.

Solar Light
Outdoor Solar Lights,Solar Light,Solar Wall Lights,Solar Christmas flame light
XINGYONG XMAS OPTICAL (DONGGUAN ) CO., LTD , https://www.xingyongxmas.com