One of the basic mechanics of tech over the past few decades has been that we tend to start with special-purpose components, but then over time we replace them with general-purpose components. When performance is expensive, optimizing on one task gets you a better return, but then economies of scale, Moore’s Law and so on mean that the general purpose component overtakes the single-purpose component. There’s an old engineering joke that a Russian screwdriver is a hammer – “just hit it harder!” – and that’s what Moore’s Law tends to do in tech. “Never mind your clever efficiency optimizations, just throw more CPU at it!”
You could argue this is happening now with machine learning – “instead of complex hand-crafted rules-based systems, just throw data at it!” But it’s certainly happening with vision. The combination of a flood of cheap image sensors coming out of the smartphone supply chain with computer vision based on machine learning means that all sorts of specialized inputs are being replaced by imaging plus ML.
The obvious place to see this is in actual physical sensors – there are all sorts of industrial applications where people are exploring using vision instead of some more specialised sensing system. Where is that equipment? Has that task been done? Is there a flood on the floor of the warehouse? Many of these things don’t necessarily *look* like vision problems, but now they can now be turned into one.
This is also, of course, the debate between Elon Musk and the autonomy community around LIDAR. A priori, you should be able to drive with vision alone, without all these expensive impractical special-purpose sensors – after all, people do. Just throw enough images and enough data at the problem (“hit it harder!”) and we should be able to make it work. Pretty much everyone does actually agree in theory – the debate is about how long it will take for vision to get there (consensus = ‘not yet’), and how hard it is to solve all the other components of autonomy. Giving the car a perfect model of the world around it, with or without LIDAR, is not the only problem to solve.
This also overlaps with two other current preoccupations of mine: how can we get software to be good at recommendation and discovery, and how does machine learning move internet platforms away from being mechanical Turks?
So far the internet has been very good at giving us what we already know we want, either in logistics (Amazon) or search (Google). It’s been much worse at knowing what we might like, without any explicit request. And to the extent that we can do this, we need people to tell it first. You have to buy a lot of stuff on Amazon or like a lot of things on Instagram or Spotify for your recommendations to be any good. Meanwhile, these systems often lack any real understanding of what the data you’re interacting with really is – hence all the jokes about Amazon recommendations – “Dear Amazon, I bought a refrigerator but I’m not collecting them – don’t show me five more”. In all of this, the user is being treated as a mechanical Turk – the system doesn’t know or understand, but people do, so find ways to get people to tell it (this is also what PageRank did).
Now, suppose I post five photos of myself and Mr Porter knows what clothes to recommend, without my having to buy anything first, or go through any kind of onboarding? Suppose I wave my smartphone at my living room, and 1stDibs or Chairish know what lamps I’d like, without my having to spend days browsing, liking or buying across an inventory of thousands of items? And what happens if a dating app actually knows what’s in the photos? No more swiping – just take a selfie and it tells you what the match is. Seven or eight years ago this would have been science fiction, but today it’s ‘just’ engineering, product and route to market.
The common thread across all of this is that vision replaces single-purpose inputs, and human mechanical Turks, with a general-purpose input. It might be facetious (or might not) to think computer vision will match dates, but the crucial change is how far computer vision means that computers can turn imaging into structured data. An image sensor isn’t a ‘camera’ that takes ‘photos’ – it’s a way to let computers see.