How Automatic Speech Recognition Works (Infographic)
Back in 2012, Apple ran a series of ads for the iPhone 4S featuring celebrities including director Martin Scorsese and actress Zooey Deschanel carrying on conversations with their mobile devices. Although critics were quick to point out that the iPhone’s Siri interface does not work as seamlessly in real life, the commercials did illustrate what the ideal automatic speech recognition program looks like.
Automatic speech recognition (ASR) is any program that translates spoken language into readable text in order to allow a user to give voice commands. This technology has been around much longer than most people think; it has been researched by the government and military since the 1950s, and it actually became available as a tool to assist individuals with musculoskeletal disabilities back in the 1980s. However, it is only in more recent years that it has become incredibly pervasive in our culture, thanks in large part to the implementation of ASR programs on smartphones.
ASR devices work by translating a user’s words (spoken into the device’s microphone) into a wave form. The wave form is then broken down into phonemes, which are the individual sounds that make up all the words in the language the device is programmed to recognize (English has 44 phonemes, for example, while French has 33 and Italian has 49). Each phoneme functions like a link in a chain. The device will identify the first phoneme and then use statistical analysis to determine what phonemes (and on a larger scale, words) are most likely to follow. This is what allows these devices to respond to queries and commands in real time.
There are two primary types of ASR: direct dialog and natural language programming (NLP). Direct dialog is the type of interaction you have when you call a business or customer service number and are asked to speak a command or menu option in order to access a certain recording. For example, you might be able to call your bank, speak or manually enter your account information, and then speak the command “Check my balance” in order to access a recording that will tell you how much money you currently have in your bank account. Direct dialog conversations are close-ended, meaning there are only a certain number of voice commands that the program will recognize and respond to.
Natural language programming, on the other hand, is more open-ended. A program may come with a set number of vocabulary words, some of which are “tagged” based on the likelihood that they will be used (for example, “weather forecast” might be tagged because many people ask their phones to look up the local weather). The program will also store data from past interactions, which will help it determine what words and phrases are statistically likely to be used together. This process is referred to as “active learning.”
Of course, ASR technology is far from perfect. The average accuracy of these programs is 96%, but that’s only under “ideal conditions.” Factors such as loud background noises, speech characteristics that do not match the training data (such as accents), and multiple people talking at the same time can all muddle a program’s wave forms and cause the accuracy level to drop. However, ASR is steadily continuing to improve, and we’ll likely see devices that are much better at accurately responding in real time—even when conditions aren’t ideal—in the coming years.To learn more about ASR, check out this informative infographic from West Interactive.