Working of voice recognition system

Alexa Please Read…

Thanks to the digital voice assistants that are getting immensely popular and changing the way we search, shop, and access online services, voice recognition technology is on the rise.
Studies indicate a whopping 50% of online access shifting to voice by 2020 which is just around the corner. I’m sure you will agree and why wouldn’t you??? After all, It’s not hard to find Siri or Alexa hard at work in many of our homes. They understand what is being said, decipher the task expected of them, and action it promptly too. Be it creating shopping lists, booking a table at the restaurant, playing music, reading back, updating the latest news, or adjusting the lighting in our homes. This is the power of voice recognition technology.

What is this voice recognition technology?

Plainly speaking, voice recognition is the alternative to typing on a keyboard. Now you can talk to devices that have the technology to capture the sound waves in the air and translate them into digital representations which a computer or smart device can understand.
By measuring the sounds a user makes while speaking, voice recognition software can measure the unique biological factors that when combined, produce the voice. Voice recognition otherwise commonly known as voiceprint is the identification and authentication process of a speaker.

How does this work?

It begins by recording voice samples of a person’s speech and digitizing them to create a unique voiceprint or template. The speech is broken down into words, their tones, mannerism including identifying specific utterances and dynamic signature, gait, and keystroke recognition. The measurements are mapped and a unique voiceprint is created.
The voiceprint has two components

Physiological Component

This, as the name refers, is the physical form of a voice like the shape of vocal tracts i.e. the larynx, nose, and mouth. Biometric technology uses the waveform of the voice sample to recreate digitally the shape of a person’s vocal tract. Every vocal track is different and thus the voiceprint is unique too.

Behavioral Component

This component includes the movement of the physiological form like a person’s jaw, tongue, and larynx. Any variation in movement reflects changes in the manner it is uttered, the pace at which it is said, and the way it is pronounced – which includes individual accents, tone, pitch, speed, etc. With this one can apply the mapped voice print to recognize the voice or speaker and also decipher speech that is the words or content which most digital voice assistants do. But the functionality of authentication and identity tracking for law enforcement agencies and other organizations using this for security and safety of data one needs to understand the representation and matching of the voice for its desired purpose.

Representation and Matching of Voices

Have you noticed a small pin-hole on the back of your phone? It is a microphone used for noise cancellation. It collects background noise and generates inverse sound waves that help cancel noise.
This is an important factor required for the processing and analysis of voice for speaker identification. Voice recognition must analyze unique characteristics of each voice and later compare it to either a master voice print of a specific enrolled identity for verification or compare across voice samples in the database in case of a stranger being identified.
A change in behavioral components like a person’s mood, stress, or illness like cold and cough have an impact on the voice; thus two different methods of recognition are utilized.

Text-dependent system (constrained manner)

In this system of analysis, an individual has to read or utter a fixed phrase that is pre-programmed in the system. The system then compares the uttered words against the master voice print and allots an accuracy score. This system requires fewer data and fixed and predetermined utterances. This improves authentication performance, especially if the users cooperate and are willing to read and reread.

Text-independent system (unconstrained manner)

In this system, longer speech input is used that’s undetermined and free. The system digitizes voice as a voiceprint that identifies speech mannerisms, its gait signature keystrokes, and more across a wide spectrum. This system requires more data, takes longer to process, and enrolls users passively without having to request specific utterances.
Most call centers use both systems; a text-dependent system better suits an app or website access tool making it more easy and convenient and a text-independent system is better for security against frauds due to its increased accuracy.

Analyzing Voices as Biometrics

The recognition and matching of voice samples are completed post the core process of analysis which truly determines the outcome. Voice or Speech samples are mapped as waveforms with loudness represented on the vertical access time placed on the horizontal axis. The speech samples are digitized from their analog format and then, the features of the person’s voice are extracted and a voice model is created. These voice models find expressions of the underlying variations and temporal changes over a period of time in any voice; they include dynamics like the quality, duration, intensity, and pitch of the vocal signal. These voice models thus produced are used to compare the similarities and differences between the input voice and the stored voice “states” to produce a recognition decision.

Hidden Markov Models (HMMs) or a statistical representation of the sounds produced by a person is mostly used to verify in a text-dependent verification system. However, the Gaussian Mixture Model, a ‘state-mapping’ model closely related to HMM, is often used for unconstrained “text independent” applications. New research in voice recognition technology has introduced the concept of an end-to-end neural speaker recognition system. The system works in both text-dependent and text-independent conditions. Here the same system can be used to recognize the speaker and the speech or text. There is much happening in this space and work to curtail the disadvantages and progress to find new applications for voice and speech recognition systems is loud and clear.