this space intentionally left blank

December 29, 2010

Filed under: tech»i_slash_o

Keyboard. How quaint.

Obligatory Scotty reference aside, voice recognition has come a long way, and it's becoming more common: just in my apartment, there's a Windows 7 laptop, my Android phone, and the Kinect, each of which boasts some variation on it. That's impressive, and helpful from an accessibility standpoint--not everyone can comfortably use a keyboard and mouse. Speaking personally, though, I'm finding that I use it very differently on each device. As a result, I suspect that voice control is going to end up like video calling--a marker of "futureness" that we're glad to have in theory, but rarely leverage in practice.

I tried using Windows voice recognition when I had a particularly bad case of tendonitis last year. It's surprisingly good for what it's trying to do, which is to provide a voice-control system to a traditional desktop operating system. It recognizes well, has a decent set of text-correction commands, and two helpful navigation shortcuts: Show Numbers, which overlays each clickable object with a numerical ID for fast access, and Mouse Grid, which lets you interact with arbitrary targets using a system right out of Blade Runner.

That said, I couldn't stick with it, and I haven't really activated it since. The problem was not so much the voice recognition quality, which was excellent, but rather the underlying UI. Windows is not designed to be used by voice commands (understandably). No matter how good the recognition, every time it made a mistake or asked me to repeat myself, my hands itched to grab the keyboard and mouse.

The system also (and this is very frustrating, given the extensive accessibility features built into Windows) has a hard time with applications built around non-standard GUI frameworks, like Firefox or Zune--in fact, just running Firefox seems to throw a big monkey wrench into the whole thing, which is impractical if you depend on it as much as I do. I'm happy that Windows ships with speech recognition, especially for people with limited dexterity, but I'll probably never have the patience to use it even semi-exclusively.

On the other side of the spectrum is Android, where voice recognition is much more limited--you can dictate text, or use a few specific keywords (map of, navigate to, send text, call), but there's no attempt to voice-enable the entire OS. The recognition is also done remotely, on Google's servers, so it takes a little longer to work and requires a data connection. That said, I find myself using the phone's voice commands all the time--much more than I thought I would when the feature was first announced for Android 2.2. Part of the difference, I think, is that input on a touchscreen feels nowhere near as speedy as a physical keyboard--there's a lot of cognitive overhead to it that I don't have when I'm touch-typing--and the expectations of accuracy are much lower. Voice commands also fit my smartphone usage pattern: answer a quick query, then go away.

Almost exactly between these two is the Kinect. It's got on-device voice recognition that no doubt is based on the Windows speech codebase, so the accuracy's usually high, and like Android it mainly uses voice to augment a limited UI scheme, so the commands tend to be more reliable. When voice is available, it's pretty great--arguably better than the gesture control system, which is prone to misfires (I can't use it to listen to music while folding laundry because, like the Heart of Gold sub-etha radio, it interprets inadvertent movements as "next track" swipes). Unfortunately, Kinect voice commands are only available in a few places (commands for Netflix, for example, are still notably absent), and a voice system that you can't use everywhere is a system that doesn't get used. No doubt future updates will address this, but right now the experience is kind of disappointing.

Despite its obvious flaws, the idea of total voice control has a certain pull. Part of it, probably, is the fact that we're creatures of communication by nature: it seems natural to use our built-in language toolkit with machines instead of having to learn abstractions like keyboards or mouse, or even touch. There may be a touch of the Frankenstein to it as well--being able to converse with a computer would feel like A.I., even if it were a lot more limited. But the more I actually use voice recognition systems, the more I think this is a case of not knowing what we really want. Language is ambiguous by its nature, and computers are already scary and unpredictable for a lot of people. Simple commands for a direct result are helpful. Beyond that, it's a novelty, and one that quickly wears out its welcome.

Past - Present