Speech Recognition

History of Speech Recognition

Writing systems are ancient, going back as far as the Sumerians of 6,000 years ago. The phonograph, which allowed the analog recording and playback of speech, dates to 1877. Speech recognition had to await the development of computer, however, due to multifarious problems with the recognition of speech.

First, speech is not simply spoken text--in the same way that Miles Davis playing So What can hardly be captured by a note-for-note rendition as sheet music. What humans understand as discrete words with clear boundaries are actually delivered as a continuous stream of sounds. Iwenttothestoreyesterday, rather than I went to the store yesterday. Words can also blend, with Whaddayawa? representing What do you want?

Second, there is no one-to-one correlation between the sounds and letters. In English, there are slightly more than five vowels--a, e, i, o, u, and sometimes y. There are more than twenty different vowel sounds, though, and the exact count can vary. The reverse problem also occurs, where more than one letter can represent a given sound. The letter c can have the same sound as the letter k or as the letter s.

In addition, people who speak the same language do not make the same sounds. There are different dialects--the word 'water' could be pronounced watter, wadder, woader, wattah, and so on. Each person has a distinctive pitch when they speak--men typically having the lowest pitch, women and children have a higher pitch (though there is wide variation and overlap within each group.) Pronunciation is also colored by adjacent sounds, the speed at which the user is talking, and even by the user's health. Consider how pronunciation changes when a person has a cold.

Lastly, consider that not all sounds are meaningful speech. Regular speech is filled with interjections that do not have meaning: Oh, like, you know, well. There are also sounds that are a part of speech that are not considered words: er, um, uh. Coughing, sneezing, laughing, sobbing, even hiccupping can be a part of what is spoken. And the environment adds its own noises; speech recognition is difficult even for humans in noisy places.

Despite the manifold difficulties, speech recognition has been attempted for almost as long as there have been digital computers. As early as 1952, researchers at Bell Labs had developed an Automatic Digit Recognizer, or "Audrey". Audrey attained an accuracy of 97 to 99 percent if the speaker was male, and if the speaker paused 350 milliseconds between words, and if the speaker limited his vocabulary to the digits from one to nine, plus "oh", and if the machine could be adjusted to the speaker's speech profile. Results dipped as low as 60 percent if the recognizer was not adjusted.^[1]

Speech Recognition Today

Technology

Speech is derived from unique sounds created from the vocal chords of the human species. Through the constant exposure to speech during a child’s development a child is able to “learn” to understand similar sounding words from different people due to the phonetic similarities in the words. The mental capabilities of the human brain helps humans achieve this remarkable capability. So far we have only been able to reproduce this in computers on a limited basis. Current voice recognition technologies work on the ability to mathematically analyze the sound waves formed by our voices through resonance and spectrum analysis. Computer systems first record the sound waves spoken into a microphone through a digital to analog converter. The analog or continuous sound wave that we produce when we say a word is sliced up into small time fragments. From there these fragments are measured based on their amplitude levels, where the amplitude is described as the level of compression of air released from a person’s mouth. To measure the amplitudes and convert a sound wave to digital format the industry has commonly used the Nyquist-Shannon Theorem.^[2]

Nyquist-Shannon Theorem
The Nyquist –Shannon theorem was developed in 1928 to show that a given analog frequency could be most accurately recreated by a digital frequency that is twice the original analog frequency. This is because, as Nyquist proved, an audible frequency must be sampled once for compression and once for rarefaction. For example, a 20 kHz audio signal can be accurately represented as a digital sample at 44.1 kHz. Interpreting Samples for Voice Recognition In speech recognition programs software will convert spoken instructions into digital samplings. These samplings will be measured against a stored database of recognized instructions. If the sample matches a stored instruction then the software executes a command. While this concept sounds simple enough, matching the sample with a stored instruction can be very difficult.

Recognizing Commands
The most important goal of current speech recognition software is to recognize commands. This increases the functionality of speech software. Software such as Sync is built into many new vehicles, supposedly allowing users to access all of the car’s electronic accessories, hands-free. This software has a small learning curve where it asks you a series of questions and based on the way you say some common words it is able to derive some constants to factor in its speech recognition algorithms and provide better recognition in the future. Current tech reviewers have said the technology is much improved from the early 1990’s but will not be replacing hand controls any time soon. ^[3]

Business

Major Speech Technology Companies

NICE Systems (NASDAQ: NICE and Tel Aviv: Nice), headquartered in Israel and founded in 1986, specialize in digital recording and archiving technologies. In 2007 they made $523 million in revenue in 2007. For more information visit http://www.nice.com.

Verint Systems Inc.(OTC:VRNT), headquartered in Melville, New York and founded in 1994 self-define themselves as “A leading provider of actionable intelligence solutions for workforce optimization, IP video, communications interception, and public safety.”^[4] For more information visit http://verint.com.

Nuance (NASDAQ: NUAN) headquartered in Burlington, develops speech and image technologies for business and customer service uses. For more information visit http://www.nuance.com/.

Vlingo, headquartered in Cambridge, MA, develops speech recognition technology that interfaces with wireless/mobile technologies. Vlingo has recently teamed up with Yahoo! providing the speech recognition technology for Yahoo!’s mobile search service, oneSearch. For more information visit http://vlingo.com

Patent Infringement Lawsuits

Speech Solutions

The Future of Speech Recognition

Emerging Technologies

Mobile Search Applications

In recent years there has been a steady movement towards the development of speech technologies to replace or enhance text input applications. We see this trend in products such as audio search engines, voicemail to text programs, dictation programs and desktop “say what you see” commands..^[5] Recently both Yahoo! and Microsoft have launched voice-based mobile search applications. The concept behind Yahoo! oneSearch and Microsoft Tellme are very similar, it is their implementation and the speech technology used in their applications which differ. With both products users speak a “search word” into their mobile phone or PDA while holding down the green talk button, the request is sent to a server that analyzes the sound bit and then the results of search appear on the mobile device.^[6] OneSearch is currently available for select Blackberry users in the US and can be downloaded from Yahoo!. Tellme is available for Blackberry users in the US by download from Microsoft and pre-installed on all Helio’s Mystos.

Yahoo! has partnered with vlingo for the speech recognition feature of oneSearch. This voice technology allows users to say exactly what they want, they do not need to know or use special commands, speak slowly or overly articulately. Vlingo’s technology implements Adaptive Hierarchical Language Models (A-HLMs) which allows oneSearch to use regional details and user patters to adapt to the characteristics of its surroundings including word pronunciation, accents and acoustics.^[7]

Microsoft’s subsidiary Tellme took a different approach to the speech recognition element of their mobile search application. Users are instructed to say specific phrases such as “traffic,” “map” or the name of business. Tellme’s senior product manager David Mitby, explains why they chose to limit speech parameters: "[because] very solid smartphone users pick this up for the first time, it’s not clear what to do. It’s not intuitive media yet."^[8]

Future Trends & Applications

The Medical Industry
For years the medical industry has been touting electronic medical records (EMR). Unfortunately the industry has been slow to adopt EMRs and some companies are betting that the reason is because of data entry. There isn’t enough people to enter the multitude of current patient’s data into electronic format and because of that the paper record prevails. A company called Nuance (also featured in other areas here, and developer of the software called Dragon Dictate) is betting that they can find a market selling their voice recognition software to physicians who would rather speak patients data than handwrite all medical information into a person’s file. ^[9]

References

↑ K.H. Davis, R. Biddulph, S. Balashek: Automatic recognition of spoken digits. Journal of the Acoustical Society of America. 24, 637-642 (1952)
↑ Jurafsy, M. and Martin, J. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. New Jersey: Prentice Hall 2006
↑ http://etech.eweek.com/content/enterprise_applications/recognizing_speech_recognition.html
↑ see "About Verint"
↑ Microsoft provides this feature on Vista operating systems. To learn more go to http://www.microsoft.com/enable/products/windowsvista/speech.aspx
↑ S.H. Wildstrom, “LOOK MA, NO-HANDS SEARCH; Voice-based, mobile search from Microsoft and Yahoo! are imperfect but promising,” BuisnessWeek, McGraw-Hill, Inc. vol. 4088, June 16, 2008.
↑ E. Keleher and B. Monaghan, “Vlingo Introduces Localized Voice-Recognition Support for Yahoo! oneSearch,” vlingo, June 17, 2008, http://vlingo.com/pdf/vlingo%20Introduces%20Localized%20Voice-rec%20for%20Yahoo%20oneSearch%20FINAL.pdf, accessed: August 8, 2008.
↑ R. Joe, “Multiple-Modality Disorder,” Speech Techology, vol 13 no. 5, June 2008.
↑ http://www.1450.com/speech_enable_emr.pdf

[1] K.H. Davis, R. Biddulph, S. Balashek: Automatic recognition of spoken digits. Journal of the Acoustical Society of America. 24, 637-642 (1952)

[2] Jurafsy, M. and Martin, J. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. New Jersey: Prentice Hall 2006

[3] ttp://etech.eweek.com/content/enterprise_applications/recognizing_speech_recognition.html

[4] see "About Verint"

[5] Microsoft provides this feature on Vista operating systems. To learn more go to http://www.microsoft.com/enable/products/windowsvista/speech.aspx

[6] S.H. Wildstrom, “LOOK MA, NO-HANDS SEARCH; Voice-based, mobile search from Microsoft and Yahoo! are imperfect but promising,” BuisnessWeek, McGraw-Hill, Inc. vol. 4088, June 16, 2008.

[7] E. Keleher and B. Monaghan, “Vlingo Introduces Localized Voice-Recognition Support for Yahoo! oneSearch,” vlingo, June 17, 2008, http://vlingo.com/pdf/vlingo%20Introduces%20Localized%20Voice-rec%20for%20Yahoo%20oneSearch%20FINAL.pdf, accessed: August 8, 2008.

[8] R. Joe, “Multiple-Modality Disorder,” Speech Techology, vol 13 no. 5, June 2008.

[9] ttp://www.1450.com/speech_enable_emr.pdf

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]