Voice and the uncanny valley of AI
First, voice is a big deal because voice input now works in a way that it did not until very recently. The advances in machine learning in the past couple of years mean (to simplify hugely) that computers are getting much better at recognizing what people are saying. Technically, there are two different fields here; voice recognition and natural language processing. Voice recognition is the transcribing of audio to text and natural language processing is taking that text and working out what command might be in it. Since 2012, error rates for these tasks have gone from perhaps a third to under 5%. In other words, this works, mostly, when in the past it didn’t. This isn’t perfect yet – with normal use a 5% error rate can be something you run into every day or two, and Twitter is full of people posting examples of voice assistants not understanding at all. But this is continuing to improve – we know how to do this now.
Second, the smartphone supply chain means that making a box with microphones, a fast-enough CPU and a wireless chip is much easier – with 1.5bn smartphones sold last year, there’s a firehose of ever-better, ever-cheaper components of every kind being created for that market at massive scale but available for anything else. In parallel, the ecosystem of experts and contract manufacturers around smartphones and consumer electronics that is broadly centred on Shenzhen means not only that you can get the parts but that you can also get someone to put them together for you. Hardware is still hard, but it’s not as hard as it was. So, if you want a magic voice box, that you plan to light up from the cloud, you can make one.
Third, the major internet platform companies collectively (Google, Apple, Facebook and Amazon, or GAFA) have perhaps 10 times the revenue that Wintel had in the 1990s, when they were the companies changing the world and terrifying the small fry. So, there’s a lot more money (and people, and distribution) available for interesting side projects.
Fourth, a smartphone is not a neutral platform in the way that the desktop web browser (mostly) was – Apple and Google have control over what is possible on the mobile internet in ways that Microsoft did not over the desktop internet. This makes internet companies nervous – it makes Google nervous of Apple (and this is one reason why it bought Android) and Amazon and Facebook nervous of both. They want their own consumer platforms, but don’t have them. This is an significant driver behind the Kindle Fire, Alexa, Facebook Messenger bots and all sorts of other projects.
All of this adds up to motive and opportunity. However, this doesn’t necessarily mean that voice ‘works’ – or rather, we need to be a lot more specific about what ‘works’ means.
So, when I said that voice input ‘works’, what this means is that you can now use an audio wave-form to fill in a dialogue box – you can turn sound into text and text (from audio or, of course, from chatbots, which were last year’s Next Big Thing) into a structured query, and you can work out where to send that query. The problem is that you might not actually have anywhere to send it. You can use voice to fill in a dialogue box, but the dialogue box has to exist – you need to have built it first. You have to build a flight-booking system, and a restaurant booking system, and a scheduling system, and a concert booking system – and anything else a user might want to do, before you can connect voice to them. Otherwise, if the user asks for any of those, you will accurately turn their voice into text, but not be able to do anything with it – all you have is a transcription system. And hence the problem – how many of these queries can you build? How many do you need? Can you just dump them to a web search or do you need (much) more?
Machine learning (simplifying hugely) means that we use data at massive scale to generate models for understanding speech and natural language, instead of the old technique of trying to write speech and language rules by hand. But we have no corresponding way to use data to build all the queries that you want to connect to – all the dialogue boxes. You still have to do that by hand. You’ve used machine learning to make a front-end to an expert system, but the expert system is still a pre-data, hand-crafted model. And though you might be able to use APIs and a developer ecosystem to get from answering 0.1% of possible questions to answering 1% (rhetorically speaking), that’s still a 99% error rate. This does not scale – fundamentally, you can’t create answers to all possible questions that any human might ever ask by hand, and we have no way to do it by machine. If we did, we would have general AI, pretty much by definition, and that’s decades away.
In other words, the trap that some voice UIs fall into is that you pretend the users are talking to HAL 9000 when actually, you’ve just built a better IVR, and have no idea how to get from the IVR to HAL.
Given that you cannot answer any question, there is a second scaling problem – does the user know what they can ask? I suspect that the ideal number of functions for a voice UI actually follows a U-shaped curve: one command is great and is ten probably OK, but 50 or 100 is terrible, because you still can’t ask anything but can’t remember what you can ask. The other end of the curve comes as you get closer and closer to a system that really can answer anything, but, again, that would be ‘general AI’.
The interesting implication here is that though with enough money and enough developers you might be able to build a system that can answer hundreds or thousands of different queries, this could actually counterproductive.
The counter-argument to this is that some big platform companies (i.e Google, Amazon and perhaps Facebook) already have huge volume of people typing natural language queries in as search requests. Today they answer these by returning a page of search results, but they can take the head of that curve and build structured responses for (say) the top 100 or 500 most common types of request – this is Google’s knowledge graph. So it’s not that the user has to know which 50 things they can ask, but that for the top 50 (or 500) types of question they’ll now get a much better response than just a page of links. Obviously, this can work well on a screen but fails on an audio-only device. But more broadly, how well this works in practice is a distribution problem – it may be that half of all questions asked fall into the top 500 types that Google (say) has built a structured response to, but how many of the questions that I myself ask Google Home each day will be in that top 500, and how often will I get a shrug?
This tends to point to the conclusion that for most companies, for voice to work really well you need a narrow and predictable domain. You need to know what the user might ask and the user needs to know what they can ask. This was the structural problem with Siri – no matter how well the voice recognition part worked, there were still only 20 things that you could ask, yet Apple managed to give people the impression that you could ask anything, so you were bound so ask something that wasn’t on the list and get a computerized shrug. Conversely, Amazon’s Alexa seems to have done a much better job at communicating what you can and cannot ask. Other narrow domains (hotel rooms, music, maps) also seem to work well, again, because you know what you can ask. You have to pick a field where it doesn’t matter that you can’t scale.
Meanwhile, voice is not necessarily the right UI for some tasks even if we actually did have HAL 9000, and all of these scaling problems were solved. Asking even an actual human being to rebook your flight or book a hotel over the phone is the wrong UI. You want to see the options. Buying clothes over an IVR would also be a pretty bad experience. So, perhaps one problem with voice is not just that the AI part isn’t good enough yet but that even human voice is too limited. You can solve some of this by adding a screen, as is rumored for the Amazon Echo – but then, you could also add a touch screen, and some icons for different services. You could call it a ‘Graphical User Interface’, perhaps, and make the voice part optional…
As I circle around this question of awareness, it seems to me that it’s useful to compare Alexa with the Apple Watch. Neither of them do anything that you couldn’t do on your phone, but they move it to a different context and they do it with less friction – so long as you remember. It’s less friction to, say, set a timer or do a weight conversion with Alexa or a smart watch, as you stand in the kitchen, but more friction to remember that you can do it. You have to make a change in your mental model of how you’d achieve something, and that something is a simple, almost reflexive task where you already have the muscle memory to pull out your phone, so can this new device break the habit and form a new one? Once the habit or the awareness is there then for some things a voice assistant or a watch (or a voice assistant on a watch, of course) are much, better than pulling out your phone, but the habit does somehow have to be created first.
By extension, there may be a set of behaviors that fit better with a voice UI not because they’re easier to build or because the command is statistically more likely to be used but because the mental model works better – turning on lights, music (a key use case for the Echo) or a timer more than handling appointments, perhaps. That is, a device that does one thing and has one command may be the best fit for voice even though it’s theoretically completely open-ended.
There’s a set of contradictions here, I think. Voice UIs look, conceptually, like much more unrestricted and general purpose interfaces than a smartphone, but they’re actually narrower and more single-purpose. They look like less friction than pulling out your phone, unlocking it, loading an app and so on, and they are – but only if you’ve shifted your mental model. They look like the future beyond smartphones, but in their (necessarily) closed, locked-down nature they also look a lot like feature phones or carrier decks. And they’re a platform, but one that might get worse the bigger the developer ecosystem. This is captured pretty well by the ‘uncanny valley’ concept from computer animation: as a rendering of a person goes from ‘cartoon’ to ‘real person’ there’s a point where increased realism makes it look less rather then more real – making the tech better produces a worse user experience at first.
All of this takes me back to my opening point – that there are a set of reasons why people want voice to be the new thing. One more that I didn’t mention is that, now that Mobile is no longer the hyper-growth sector, the tech industry is casting around looking for the Next Big Thing. I suspect that voice is certainly a big thing, but we’ll have to wait a bit longer for the next platform shift.