How does voice recognition work
Alexa Please Read…
Thanks to the digital voice assistants that are getting immensely popular and changing the way we search, shop and access online services, voice recognition technology is on the rise.
Studies indicate a whopping 50% of online access shifting to voice by 2020 which is just round the corner. Iam sure you will agree and why wouldn’t you??? After all, Its not hard to find Siri or Alexa hard at work in many of our homes. They understand what is being said, decipher the task expected of them and action it promptly too. Be it creating shopping lists, booking a table at the restaurant, playing music, reading back, updating latest news or adjusting lighting in our homes. This is the power of voice recognition technology.
What is this voice recognition technology?
Plainly speaking, voice recognition is the alternative to typing on keyboard. Now you can talk to devices that have technology to capture the sound waves in the air and translate them into digital representations which a computer or smart device can understand.
By measuring the sounds a user makes while speaking, voice recognition software can measure the unique biological factors that when combined, produce the voice. Voice recognition otherwise commonly known as voiceprint is the identification and authentication process of a speaker.
How does this work?
It begins by recording voice samples of a person’s speech and digitizing it to create a unique voice print or template. The speech is broken down to words, its tones, mannerism including identifying specific utterances and dynamic signature, gait, and keystroke recognition. The measurements are mapped and a unique voice print is created.
The voice print has two components
This as the name refers is the physical form of a voice like the shape of vocal tracts i.e. the larynx, nose and mouth. Biometric technology uses the wave form of the voice sample to recreate digitally the shape of a person’s vocal tract. Every vocal tract is different and thus the voiceprint is unique too.
This component includes the movement of the physiological form like a person’s jaw, tongue and larynx. Any variation in movement reflect as changes in the manner it is uttered, the pace at which it is said and the way it is pronounced – which includes individual accent’s, tone, pitch, speed etc.
With this one can apply the mapped voice print to recognize voice or speaker and also decipher speech that is the words or content which most digital voice assistant do. But the functionality of authentication and identity tracking for law enforcement agencies and other organization using this for security and safety of data one needs to understand the representation and matching of the voice for its desired purpose.
Representation and Matching of Voices
Have you noticed a small pin-hole on the back of your phone? It is a microphone used for noise cancellation. It collects background noise and generates inverse sound waves that help cancel noise.
This is an important factor required for processing and analysis of voice for speaker identification. Voice recognition must analyse unique characteristics of each voice and later compare it to either a master voice print of a specific enrolled identity for verification or compare across voice samples in the database in case of stranger being identified.
A change in behavioral components like a person’s mood, stress or illness like cold and cough have an impact on the voice; thus two different methods of recognition are utilized.
Text-dependent system (constrained manner)
In this system of analysis, an individual has to read or utter a fixed phrase that is pre programmed in the system. The system then compares the uttered words against the master voice print and allots an accuracy score. This system requires less data and fixed and predetermined utterance. This improves authentication performance, especially if the users cooperate and are willing to read and reread.
Text-independent system (unconstrained manner)
In this system a longer speech input is used that’s undetermined and free. The system digitizes voice as a voice print that identifies speech mannerisms, its gait signature keystrokes and more across a wide spectrum.
This system requires more data, takes longer to process and enrols users passively without having to request specific utterances.
Most call centers use both the systems; a text dependent system better suits an app or website access tool making it more easy and convenient and text independent system is better for security against frauds due to it’s increased accuracy.
Analyzing Voices as Biometrics
The recognition and matching of voice samples is completed post the core process of analysis which truly determines the outcome. Voice or Speech samples are mapped as waveforms with loudness represented on the vertical access time placed on the horizontal axis. The speech samples are digitised from their analog format and then, the features of the person’s voice are extracted and a voice model is created. These voice models find expressions of the underlying variations and temporal changes over a period time in any voice; they include dynamics like the quality, duration, intensity and pitch of the vocal signal. These voice models thus produced are used to compare the similarities and differences between the input voice and the stored voice “states” to produce a recognition decision.
Hidden Markov Models (HMMs) or a statistical representation of the sounds produced by a person is mostly used to verify in a text dependent verification system. However the Gaussian Mixture Model, a ‘state-mapping’ model closely related to HMM, is often used for unconstrained “text independent” applications.
New research in voice recognition technology has introduced the concept of an end-to-end neural speaker recognition system. The system works in both text-dependent and text-independent conditions. Here the same system can be used to recognize the speaker and the speech or text. There is much happening this space and work to curtail the disadvantages and progress to find new applications for voice and speech recognition system is loud and clear.