Have you ever seen a robot talking to a human?? I used to see that in my dreams during my childhood days. But this is not a matter of dreams or imaginations now.

This has become practical and made possible by the technique of Speech-recognition.

Speech Recognition is that ability of any machine which helps them identifying words and phrases in language and converting them into machine convenient language. To work with speech recognition it is required that our software must be sophisticated so that it can accept the speech very clearly. Call routing, voice-dialing, these all things comes under recognition process.

Speech recognition is classified into two categories

Speaker dependent and speaker independent

Speaker dependent systems are trained by a single person, the person who is using the system. So we can say that system which is trained by the user. These systems are very efficient and capable of getting high command count. But the system only gives response to the person who trained the system.
Speaker independent is a system trained to respond to a word independent of who speaks. Thus the system must respond to a large variety of patterns of speech. It is to be noted that the voice input device is mounted on the controller so that the commands related to the movements can be given by voice. When commands are inputted by any microphone. Analog electrical signals which represents voice are first converted into digital form. This is done by an analog to digital converter. And after that these digital signals are given as an input to robotic controller. The robot controller must have a kind of filtering device which is used to filter the input data in the form of voice. To improve accuracy and voice we use a conversion modelling process and form a system response.

System recognition circuit

What we do is to try to train the circuit up to 40 words. Suppose on pressing number '1' to train word number 1. On pressing any number, the red LED will be turned off. Those numbers are displayed on the digital display. After that press the"#" button to train. Pressing the "#" button will provide signals to chip so that it can listen training words and causes turning on LED. Then do a test, the next step is to speak the word you want the circuit to be recognized. For this microphones are used. With the acceptance of a word, the LED will blink, thus shows that the word has been accepted. Similarly, if you want to enter third word, you just have to write 3 followed by '#' word.The circuit has the capability to listen continuously. It is also to be noted that each word which is entered should be displayed.

Speech Recognition system is consisted of four parts

  • Linear separation of the sources
  • Multi-channel post filtering
  • Computation of the missing feature from the output which is post-filtered
  • Speech recognition using the separated audio
Now the question must be arising in your mind that the microphone array which we are going to use in this case. The array is composed of 'n' number of omni-directional elements. These elements are mounted on robots. The sources are detected and localized with the help of any appropriate algorithm.


Source separation stage- This stage consists of a linear separation based on Geometric Source Separation. Modifications can also be done to get faster adaptation and shorter time frame estimations.

Post-filter

It is to be noted that the separation which we are talking about by using GSS, is followed by Multichannel post -filter. It is based on the generalization of beam-former post filtering for multiple sources. We do spectral estimation of background noise. The noise which we estimate is decomposed into stationary and transient components.

Mask Computation

Multi-channel post-filter not only reduces the amount of noise present at a certain time at a particular frequency. We use post-filter to do estimation of missing feature mask. It also indicates that how much a spectral feature is reliable?

Recognition: For recognizing the speech, we can use any kit, the kit is based on the missing feature theory. In this process of speech recognition, an acoustic model with a search algorithm is used.

A frequency domain post filter is actually based on optimal estimator. We consider that all the interferences (except the background noise) are localized which is detected by the localization algorithm. It is also to assume that leakage between channels is constant. Leakage is caused by localization error or due to the differences in microphone frequency responses.

Missing Feature Mask

The missing feature mask is actually a matrix which represents the reliability of each feature in the time-frequency plane. This reliability is actually a continuous or discrete as well.  This value can range from o to 1.The more noise present in any frequency band the lower the post-filter gain will be for that band. We use a circuit for recognition of speech. For building the circuit an important part is IC.  These chips provides the option of recognizing some words in any particular time. The time can be in seconds. For memory circuit uses static RAM. The chip has two operational modes:

Manual mode and CPU mode. CPU mode is designed to allow the chip to work under a host computer. This is good to know that for listening and recognition, there is no requirement of computer's CPU time.

On the other hand, manual mode allows the user to build a stand alone speech recognition board, it does not require any host computer.

It has some applications

Command and control of equipments. Telephone assistance system Data entry. Speech recognition is not about understanding the speech. We should not forget that computer is a machine and a machine never understands the vocal command it just can respond for that.

References:

Speech recognition in c#

Programming speech in WPF-Speech recognition

Next Recommended Readings