Who's Talking Now?
more on the topic
Captain Kirk couldn't type. He didn't need to. In his fictional world, the computer was smart enough to do that laborious task for him. He just told it what he needed, and in a flash, the computer spoke the answer. In comparison, the current interface with the computer-riddled world is primitive. Keyboards act as the interface to convert thoughts into a form the computer can understand.
Think about the computer sitting on your desk right now. How much of it is really computer and how much of it is input or output devices? The keyboard, monitor, printer, fax machine and speakers are all things you need to communicate with a computer. Even the laptop is mostly screen and keyboard. Going mobile with data is not going to be easy as long as you have to carry along all of the baggage. Some have said that the perfect wireless device will have a large screen, a full keyboard and will fit nicely in a small pocket. That just isn't going to happen. People don't need a better computer; they need a better way to interface with it.
COMPUTER INTERACTION You can talk to your computer today. Software programs such as those from IBM and Dragon allow you to dictate to your computer, and the words you speak magically appear on the screen. The technology is not perfect. Some words are converted incorrectly, formatting can be incorrect, and the programs take a huge processing byte out of your computer. But it can be done.
Having your computer talk to you is more complex and is an important part of effortless mobile computing. E-mail, for example, mostly is text. Using sophisticated services such as Portico, Webley or Wildfire, you can have your e-mail read to you and then respond.
However, even with these services, there are still hurdles to overcome. Portico "listens" to every sound you make. If you say "um," it thinks you mean it. It also has a bad habit of responding to background noise. Using Portico in a noisy environment is nearly impossible, and finding a quiet environment is not always an option. Although the voice is robotic and unnatural, it is understandable, and applications for this technology are huge.
While traveling, users can have e-mail read to them via any phone. Users can send responses via an attached media file. They also can filter the e-mail so they receive only the most important messages wirelessly. Although the service offers more features, such as a news-clipping service and the ability to keep an appointment calendar, right now, access to e-mail is the No. 1 application for such services.
General Magic offers a free version of Portico, an e-mail service called MyTalk. Although it doesn't offer all of the bells and whistles of the pay service, the free service does let users check e-mail from any phone and respond by voice as well. Users even can make free long-distance calls via the service. The catch? The calls can last only two minutes, and users have to listen to an advertisement first. The MyTalk Web site also is advertiser-supported. Perhaps, like many Web-based services, the future virtual assistant will be a free service supported by advertising.
GOING VERTICAL E-mail access, however, is only the tip of the iceberg. There are specialized vertical applications for this technology as well.
Conita plans to apply text-to-speech technology to several vertical markets. Its products include V-Enterprise, V-Medical, V-Financial, V-Legal and V-Insurance. Each offers the standard virtual assistant capability that competitors traditionally have provided, but these also allow access to specialized databases and internal information via voice.
Consider the effect this technology could have on these vertical markets. Doctors could use Conita's service to access medical databases, check patient status, schedule procedures and keep up with all incoming voice and e-mail information. Attorneys could use Conita to access case law, access the vast Lexis legal library and schedule court dates. Insurance adjusters could meet the demands of a remote claims adjustment.
The technology also could be powerful for corporate users. V-Enterprise integrates with existing systems. Users could access a shared calendar remotely, access company databases to check on order status or place an order over the phone. They also could access sales applications that provide access to customer status, preferences, order history, contact information and reminders of important events.
Services such as those offered by Portico and Conita are part of the larger trend of unified messaging, which has huge potential. Ovum recently predicted that the fixed/mobile market currently is at $2 billion and is expected to soar to $35 billion by 2005. A big part of this trend is the migration of more traffic to mobile networks and customers' demands for a single place to check for all message types. Text-to-speech and voice-command technology will be an important piece of the puzzle to provide that functionality to end-users.
To the wireless industry, this trend means more airtime usage. As e-mail and Internet access becomes more embedded in the mainstream of business and personal users, voice-based access to those services will open the door to millions of users who otherwise would not attempt to tackle the learning curve. Not everyone is comfortable with a smart phone or wireless PalmPilot. Like the good captain, they just want to be able to tell the computer what they want and let the machine do the hard part. Given the choice, who wouldn't?
Better databases, less-expensive computer memory and more processing power have enabled linguists and phoneticists to implement more advanced solutions than possible with traditional text-to-speech (TTS) technology. Developers now use next-generation speech engines to create voice interfaces that lay the foundation for new applications such as e-mail and Web readers. These speech engines generate words by phonetic rules, so vocabularies are unlimited. The achievement of a truly natural-sounding human voice already is making current TTS applications much more compelling, but the future of the voice interface hinges on the computer's ability to interact with the user like a human would.
Computers must be able to generate questions to clarify what they've heard just as humans do. Until recently, all computer responses have been pre-recorded, which solved the problem of a realistic voice interface. But it also restricted the computer to answering only the questions the developer anticipated. The newest synthesizers enable the computer to generate any follow-up question because automatic speech recognition (ASR) is evolving as well. Next-generation ASR uses natural-language understanding, an artificial-intelligence-based technology both to recognize words and to understand their context.
There are two main TTS technologies: formant and concatenative synthesis. Formant synthesis models speech synthesis based on the way humans produce sound using their lungs and vocal chords. Concatenative systems use chips to store segments of recorded human speech in the form of phonemes, diphones and triphones, which are fragments and combinations of the smallest units of speech that distinguish one utterance from another.
Developers have realized that the larger the speech segments they use, the more natural the voice sounds. However, more memory is needed to store and access these segments.
The new concatenative speech synthesizers join larger segments, such as syllables, words and phrases, where there are several hundred thousand possible segment combinations to each unit. The challenge is to achieve the highest-quality speech with the smallest database and the least amount of processing. The computer must be able to find the best segments to use quickly and then glue these segments together in such a way that end-users don't hear the concatenation points.
popular articles
Want to use this article? Click here for options!
© 2008 Penton Media Inc.












