On Audio Interfaces
Over the last few months, and especially since the launch of the iPhone 4s and Siri, I’ve been seeing more and more products pitching and building interfaces that are almost wholly in the audio domain — that is, voice controlled experiences.
The logic goes something like this: we talk all the time — it’s our natural mode of interaction with people, so our interactions with our phones and computers and cloud — they’ll be better and more natural if we can make them controlled by our own voices, instead of this weird invented interaction we have (historically keyboards, more recently by pushing around pictures under glass — also an outstanding essay, btw).
But I think there are many real problems here, and so I’m skeptical.
There’s the problem of speech recognition, which is real, but I think will get solved over time so doesn’t really bug me.
I think more crucial is a concept that Don Norman identified probably 20 years ago, which is this: in interface design sometimes you rely on knowledge in the head, and sometimes you rely on knowledge in the world.
Knowledge in the head is what we require users to know, to keep in mind, without seeing any tips or pointers. It sounds bad but it really isn’t. Command line interfaces are like this — and while they’re not always very learnable, they’re often very very powerful. Knowledge in the head is most useful when the space of possibilities is huge, when there are many more tasks than you can display easily. But obviously there’s a problem: you have to remember things.
Knowledge in the world is visible UI elements. Signs, lists, buttons. The obvious advantage here is that you can see the things that are possible, and choose them. The problems are also obvious: real estate in your vision is significantly more limited than it is in your imagination.
Voice interfaces suffer from a combination of problems. The first is that they’re essentially completely knowledge in the head. That by itself isn’t such a problem. And in fact, if our computers (of all sorts, including our phones) could reliably understand not only the words but the concepts of our voice commands, the knowledge in the head problem fades away, and conversation takes over.
The problem is that the state of Natural Language Processing — NLP — isn’t good enough today to reliably interpret what we want. And in fact, it’s still just very limited in the types of things it can really understand.
So what that means is that we’ve got a system that is very limited (compared to human conversation) in what it can understand, but an interface without any visual cues (knowledge in the world) to help us understand which commands are viable. So with something like Siri, in a lot of ways you’re left with a system that functions much like command line UIs: good at understanding a specific set of queries, but not much good beyond that.
Siri is a big step forward, for sure. It understands more of what you’re talking about than systems previously, and it can ask you for clarification when it doesn’t.
But in practice, I’m having trouble using it because I don’t have a good map in my head of what it can do and what it can’t, so it becomes a highly unreliable experience for me. All I really use it for is to set my alarm clock when I’m traveling. Because I can reliably remember and invoke that command.
So until our systems get better at understanding what we’re saying, what we get with audio UIs is a space of invisible commands that are only discoverable by time consuming experimentation (and watching Apple commercials!). Because we’ve got a UI that advertises no manual needed. But that’s not really true right now.
All of that put together means I’m a little skeptical at the moment. Hopeful, though, because it would make a lot of things much easier.