How I would put voice control in everything (Interconnected)
How I would put voice control in everything
20.18, Tuesday 26 May 2020
Link to this post
Why can’t I point at a lamp and say “on”“ and the light come on? Or point at my stove and say “5 minutes”? Or just look at it and talk, if my hands are full.
I speculated about voice-controlled lightbulbs and embedded machine learning on stage last year at Google’s People & AI Research symposium (there’s a link to a video on that page) and was reminded about it the other day when George Buckenham tweeted as someone who already owns an Alexa, I would buy a device that doesn’t do any cloud processing, but does allow you to set kitchen timers with your voice and play songs from Spotify
– which is basically all I do with Siri too, and this is kinda what I want too…
…only not a single device, I want voice control in everything, but individually. And really, really basic.
Because it is really appealing to me to turn on a light, set the stove timer, play music, pause the TV, snooze an alarm etc just by saying something. What’s not cool is
- having a device in my home that harvests every sound in the house and sends it to cloud servers for eternal recording, or not, who knows and that’s the point – an audio panopticon dressed in plastic
- needing to remember arcane vocal syntaxes
- latency.
And all of that aside, voice assistants are still all more or less rubbish.
So how should this work?
Do less. Do it really well. Reduce cognitive friction.
Make a lightbulb that you can say “on” and “off” to:
I was struck to learn that the iPhone’s “Hey Siri” feature (that readies it to accept a voice instruction, even when the screen is turned off) is a personalised neural network that runs on the motion coprocessor. The motion coprocessor is the tiny, always-on chip that is mainly responsible for movement detection, i.e. step counting.
If that chip hears you say “Hey Siri”, without hitting the cloud, it then wakes up the main processor and sends the rest of what you say up to the cloud. This is from 2017 by the way, ancient history.
So, commodity components time: here’s the BelaSigna R281, an ultra-low-power (300 micro watts, mic not included) chip that is “always listening” and will detect a single, user-trained trigger phrase, asserting a wake-up signal when this trigger phrase is detected.
A embeddable wake-word detector! Let’s stick it in a lightbulb! A radio! A desk fan!
So how would a device with this simple word detector know when to pay attention? Some wild speculation…
- an on-chip, low power image sensor – a MEMS camera maybe? With the added ability to…
- detect glances – detecting the whites of our eyes in a busy image by limited compute is basically what the whites of our eyes are there for
- detect pointing – harder, but (waves hands) machine learning??
(Bonus points: do all of this with energy harvesting, so no batteries, and zero power on standby.)
Look, my point is that this is not beyond the reach of very clever people with computers. Stick a timer in my stove, a switch in my light bulb, give each a super limited vocabulary, never connect to the internet, and only act when somebody is addressing you.
Which, in turn, gets rid of the complicated set-up and addressing interaction design issues of centralised voice assistants. No more “front room lights: lamp 1 turn on” because… you just look at it.
And also gets rid of the need to add expensive connectivity (and set-up, and security patches…) in every stove and light, and the need to convince every manufacturer to support the latest control protocol because… you just look at it.
And, ALSO also, by simplifying but spatialising the available grammar, the voice interface will be easier to learn, more reliable to use, and easier for normal humans to combine.
And yes, given this leeway, different manufacturers will go in slightly different directions. But net-net I bet that the overall simplicity is improved versus the current approach of attempting to make standardised interfaces for classes of products that have to be tweaked case by case to properly fit.
It’s a classic worse is better approach.
Why are we stuck with portals for voice?
And the reason it doesn’t work like that already, and why we’re stuck with dedicated, centralised voice assistants that need to bounce a signal off a data centre on the freaking Moon (not actually the Moon) to set a timer? Well, I can imagine a few possibilities…
- Cynically: every big tech company wants to “own” voice interactions, and be a gatekeeper to all smart devices for STRATEGIC REASONS, which is daft because trying to own an entire interaction model like that is like saying “ok let’s own buttons”.
- Dealing with voice is sufficiently complex that you need giant cloud servers to do it, and the code requires such frequent updating that device-embedded detection doesn’t make sense. I’m not sure this is the case any longer, and besides, that’s what over-the-air Bluetooth updates are for.
- Centralised voice assistants allow for more complex use cases, such as orchestrating different devices, and interacting with cloud services. Such as: if the traffic is heavy this morning, turn on the lights 20 minutes earlier.
I think that last point is probably what’s going on. I get it.
BUT
There’s that line from John Gall: A complex system that works is invariably found to have evolved from a simple system that worked.
So let’s get the basics right first, then layer orchestration and all the advanced stuff etc on top?
Here’s the boil-the-ocean approach
If I had all the VC money in the world, I would manufacture and sell standardised components – they would connect and act identically to mechanical buttons, switches, and dials, only they would work using embedded ML and have voice, gaze, and pointing detection, for interaction at a distance.
The goal would be to allow manufacturers of every product to upgrade their physical interfaces (add not replace ideally), no matter how trivial or industrial, no matter how cheap or premium. And, by doing that, discover what new possibilities are uncovered when when you don’t force every voice interaction through a single model, that of requiring an internet-connected, consumer-friendly, device for the home.
Anyway.