By Steve Welker
There’s a scene in “Star Trek IV: The Voyage Home,” when Cmdr. Montgomery Scott, chief engineer of the Enterprise, visits a company named Plexicorp to buy “transparent aluminum.”
Its president, Dr. Nichols, never heard of transparent aluminum, so Scotty offers to give him the formula. Dr. McCoy objects, warning that this might change the future. Scotty replies, “How do we know he didn’t invent the stuff?”
Scotty then leans in to Dr. Nichols’ Apple Macintosh Plus and politely says, “Computer?”
McCoy hands Scotty the mouse.
Scotty, speaking into the mouse, says, “Hello, computer.”
Dr. Nichols: “Just use the keyboard.”
Scotty: “A keyboard. How quaint.”
Well, reality is catching up with science fiction. Very quickly, talking to computers will become commonplace.
Engineers and scientists have spent years working with experts in voice and language to develop speech-recognition systems. The ultimate goal is to combine hardware and software and eliminate keyboards. You simply speak to the computer and it turns your words into text. Or give it an order, “Computer, call home!” and your PC, personal digital appliance or “smart” cell phone consults its memory and digitally connects to your home telephone or household computer network.
Such software exists today, but it’s nowhere near replacing a keyboard. Speech-to-text software interprets language at a speed close to normal conversation, but the error rate is around two words out of every 100 and the only easy way to make corrections is with a keyboard.
A big problem is familiar to every Southerner trying to talk with a New Yorker. Differences in dialect, pronunciation, accent and grammar can make it tough for humans to communicate. It’s even harder to train a dumb computer to recognize different voices. That’s why speech-to-text software requires users to read a few standard pages and “teach” the computer how to interpret its master’s voice.
Let me give you an example. My 16-year-old son, whose lifelong friend is a girl from Minnesota, habitually says “hoose” as she does instead of “house” as I do. I have no trouble understanding him, because I hear the word in a context, such as, “Can I go over to Elizabeth’s hoose?” But without a huge digital memory of vocabulary, variations in pronunciation, sample sentences and contextual examples, a computer’s speech-recognition system will, at best, spell out “hoose.” At worst, it won’t recognize the word at all, forcing someone to type it in.
The more you delve into this, the more you realize the fiendish difficulty of speech recognition. One of my mentors, the late Robert Ross, founded Voice Response Inc. in the early 1970s to develop speech recognition for companies who handle high volumes of telephone calls. It took 10 years for VRI to study, understand and solve the problem of pauses and punctuation. When you stop speaking, how does a computer “know” whether you have paused to gather your thoughts or to end a sentence? To Bob’s credit, his passion and determination led to significant advances in voice recognition, even though VRI failed as a commercial venture.
But that was before the birth of the World Wide Web.
Consider what people do as they surf the Web. Most don’t use a keyboard very much. They might type a few words into a search engine. They click buttons and links. They scroll up and down a page. A small vocabulary describes all those things. It’s not very hard to put the key words and actions in a speech-recognition program.
Or think about using a bank’s ATM machine. Again, your actions are limited to a relatively small number of choices described by few words: “Deposit. Withdrawal. Inquiry. Other.” The words don’t sound alike; variations between a Southern drawl and a Yankee burr aren’t likely to prevent the computer from distinguishing between “checking” or “savings.”
Seeing the value of such a simplified approach, the World Wide Web Consortium (W3C) has evolved a set of standards for interactive voice response programs on the Internet. Version 2.1 of VXML (Voice eXtended Markup Language) was moved up to "Recommendation" status a few weeks ago. As often happens, there’s also a competing standard: SALT (Speech Application Language Tag) developed by Microsoft. Each has its good features; I expect both to be incorporated into the draft plan for VXML 3.0.
I can’t overstate the impact these standards will make. The World Wide Web grew so fast in part because W3C set standards for creating Web pages. In general, it doesn’t matter what computer or browser you use to create or read Web pages, see digitized photos and video or listen to digital sound. With VXML in place, I expect to see an explosion in voice-based applications for both computers and telephones.
We can look forward to the day when, if a piece of software crashes, the computer will hear us cussing and understand what we mean ... in the future.