What Bots May Come
A Learning Architecture for the Next Paradigm
Digital assistants are common in mobile markets today. Apple’s Siri, Amazon’s Alexa, Facebook’s M, Google Now and Microsoft’s Cortana all provide a single interface to control specific app capabilities. However, none of them allow us to do anything drastically advanced other than reducing the number of taps we make on a phone.
There is one virtual assistant however, which is making all the waves in this space. Its called Xiaoice, a chat bot developed by MSRA. It lives within apps like Line, WeChat and Weibo and boasts a user base of over 40 million, most of who spend hours chatting with it.
In truth, labeling Xiaoice as an assistant is an understatement. Xiaoice can comment if you’ve had a haircut. She can suggest products to purchase based on your conversations. She will respond angrily if you antagonize her. She is emotional, talkative, friendly and personal — in addition to supporting the usual transactional applications any user would demand. She is the 6th-most active celebrity on a platform that boasts 198 monthly active users; 25% of whom have said ‘I love you’ to her. And she’s a bot.
There is lots of buzz about slash bots, chat bots and conversational assistants in the media and in the startup /tech space. Good discussions are emanating on opposite ends of a spectrum — some exploring the cultural implications of bots while others focusing on business and operational opportunities.
But unless we can assemble the scientific modules and anchor the technological solutions that will enable these chat bots to converse at scale, both cultural and business propositions may get prolonged, even aborted. Do we need a new learning architecture that powers conversational bots? Should the design be linear or layered? And what role will humans play in it?
Paradigm Shift
The way humans perceive personal assistants is changing, as the word personal starts to take precedence over the assistant role. It may be only a matter of time until conversational agents invade the US market. There is growing anticipation because stats around user interactions with chat bot vs. app usage is incredible. Both the AppStore and Google Play host over 1.5 M apps each. Yet on average, the number of apps downloaded by a person in the US every month is Zero.
Apps aren’t dying. But the entire space is collapsing, just like so many other industries before it. Its too crowded now, too hard to break in, numerous forced taps just for on-boarding and countless separate interfaces to keep track. Apps come with their own friction components — walled gardens, sign-up drags, untimely push notifications and re-installs. Both app makers and app users are getting increasingly frustrated with the ecosystem.
It definitely feels like a paradigm shift is about to happen on the Internet. And with every fresh paradigm comes the challenge of adapting to three things — the technology, the culture and the business model. The tech adaptation invariably calls for a new architecture.
For the techies out there, recall how communications was first realized in computer networks — the layered architecture of the OSI model. Each layer had a purpose. Each layer served the layer above it and was served by the layer below it. The Internet is built upon this model. Will Human-Bot Interconnection models evolve in similar way?
Why a learning architecture is imperative to conversation
In scaling smart conversational bots, there are several disparate features that consumers would demand. User expectations could vary from purely transactional to efficiency-oriented or even unexpectedly personal. The bot must have a deeper understanding of what the user is trying to do.
For example, we are all perpetually starved for Attention. Bill gates, in his annual letter titled “Two Superpowers We Wish We Had” mentions ‘more time’ as the number one ability we crave. Bots can help us achieve that, by guiding our focus to whats necessary or relevant and reject noise.
If apps were in the business of selling us distraction, bots will be in the business of buying us time.
So the architecture must support the ability to learn what’s relevant from what’s not, using past conversations or ongoing events in the world.
Secondly, there will be operational or transactional tasks that bots have to accomplish. In pure transactional services, people crave convenience and loathe friction (interesting to note, it is in the points of friction where apps have traditionally found success in monetizing themselves). Not to fret, because there are certain apps we will always continue using. Some apps will host future chat bots. But a lot of this friction will be exchanged for convenience. Transactional applications isn’t hard for bots to achieve, as long they have hooks to related APIs.
Thirdly, chat bots promise to eradicate a glaring disadvantage of the human condition — that our friends and confidants aren’t always around. So on one-hand, we will have chat bots striving to become virtual friends. On the other hand, they will act as friction reducers and attention linchpin.
Irrespective of whether people care if bots possess some level of human personality in conversation or they just need bots to remove friction or help us focus — accomplishing even a small segment of these tasks means the bot must be robust. Given the fraction of human responses we can pre-program, fixed templates will become useless in the long run. Eventually, bots that can learn templates faster will win because adaptability is key. Adaptability is not something we usually associate with apps. Which is why I think the destiny of bots is not to beat apps. It is to make apps irrelevant.
The Layered Architecture for Human Bot Interconnection
Chat bots have historically been marginalized as weak AI. But things have changed since the days of Eliza. We’v had two gigantic revolutions — in data and computation power. Bots will be the bridge between data and action. This action will directly empower a consumer or an intermediary editorial expert.
Because the architecture has to support transactional services and direct requests, some search space analysis is necessary. The bot’s response needs to be precise. It can’t push 10 blue links and overwhelm the user. So ranking algorithms and recency will be a key, once again. For more immersive conversations, three things are indispensable: personality, comprehending some level of user emotion in communication and recalling details from past conversations. Conversations are good when there is richness of the contextual information with focus on details.
But neither technology nor science is an elevator ride. Over the next months, several modules in different fields ranging NLP to HCI needs to deliver at production grade. And wherever the tech is infant, humans will contribute.
The Internet you are on right now is supported by a layered architecture, which became a standard for world wide communications. For human bot interconnection, we need a similar layered learning architecture.
The learning architecture for conversational bots resembles a pyramid, where the lowest layers represent technological modules that scientists/developers are most confident will function accurately. In the lower layers, not only is there broad applied research but also more working software. Thus, the base layers of the pyramid are wider. The more complex blocks/modules will stand on top of these ‘solved’ layers. The higher the layers, the more unsophisticated our current algorithmic methods are. Which alternately means to accomplish higher layer tasks, human involvement is necessary.
The Modules in the Layered Learning Architecture
Layer 1,2 — Facts, Topic Comprehension: Most personal assistants are good at answering facts and even topical questions. State of the art NLP is at a decent place with extracting Named Entities from text. Knowledge extraction and faceted search has also grown leaps and bounds. But both these require serious ontology engineering (e.g., Google’s Knowledge Graph, Facebook’s Open Graph or Microsoft’s Trinity), which remains more challenging than people think. I am fond of DBpedia but at the same time skeptical if its ready to be used for production grade.
In layer 2, there has been a decade of research on Topic Models with satisfactory results. This will suffice for layer 2, although a few qualms remain such as the human interpretability of the algorithmically extracted topics.
Layer 3— Learning from Social: Learning events, topics and trends from social streams will only get better. The bot will send notifications about important and relevant things without requiring us to constantly refresh or be perturbed by FoMo. There is noise in social, but it can be eliminated with constant empirical modeling and human intervention/editing at the last-1% stage. In fact, 26% of Xiaoice’s core learning happens through her own conversations with humans, while another 51% is learned from social media data.
Social streams have always been fast. There are some really good algorithmic filters that slow down the stream. But with app overload, anxiousness skyrockets just thinking about the numerous app streams we must consume before the next day begins. This is where bots will fish in the stream for us so we don’t constantly feel like drifting. Summarization and explainers will liberate users from having to read every article served by the numerous recommendation services out there.
Layer 4 — Emotion Understanding: Everyone has heard of sentiment analysis by now (some of you cringed on hearing it). In reality, the dismissive backlash is because positive or negative sentiments in a single sentence have rudimentary influence, might often be irrelevant and ultimately are pointless on its own. What we need to work towards understanding fundamental emotions in texts.
This sounds like moonshot, but actually isn’t. There is a whole field called Affective Computing dedicated to this. Even if we can detect the six primary emotions with considerable accuracy, it will be a huge step in conversing with an user and framing relevant responses. For brevity, I’ll skip mentioning other NLP modules needed in the first 4 layers. Things like anaphora resolution can be quite interesting and challenging.
Layer 5 — Contextual and Episodic Memory: Even the best conversational chat bots today suffer from something resembling ADHD. They jump between topics, incapable of sustained flow or conversational rhythm. Past conversations in context are hardly reintroduced. There’s a great opportunity here because every time the user types a sentence, it can be treated as a survey.
A bot can do little transitive reasoning with some sort of dynamic episodic memories. And just because we are struggling to understand episodic context right now doesn’t mean its need goes away. It is the most vital piece of the puzzle. Scientists are trying to tackle this problem using Long Short Term Memory, a type of recurrent neural network. Stack a few of these on top of each other and you can call it Deep Learning. At the minimum, it wouldn’t be hard to produce a starkly advanced version of Google Alerts.
Layer 6 — Personality: We are quite far from achieving any sort of voice or personality for a bot at least from a computation standpoint. There are attempts to train personality type responses using data from books or social media. Social media chatter can be a double edged sword though. I once heard a researcher suggest that we could train a conversation model using the Reddit comments database. I wondered if the algorithm would pick up ‘trolling the user’ as one of its principal components.
Humans in the Loop
Modules in layers 1, 2 and 3 in the learning architecture are more robust. Modules in layers 4, 5 and 6 needs a lot of work. The promising aspect of Computer Science is that unlike many other sciences, it is strongly driven by the industry. If there is sufficient industry demand, except applied research to be productized soon.
At present though, the best choice seems to have personality modules inserted by editors and creative writers. Which makes creative editors/writers remarkably relevant in the age of conversational bots. If you are a writer this should excite you. Think of all the conversations written for MMORPGs or RPGs — a decision tree of powerful sentences can tell a powerful story.
The Climb
Skeptics claim that chat bots have been around for a long time. Why would it invade consumer markets now? Whats different this time? Why didn’t all this tech modules amalgamate into an architecture before?
You could find good explainers in the annals of data, computation or consumer markets. But personally, there’s one conjecture I find more useful. if a technology is frequently rediscovered, there are inevitable elements in it.
Where we are with chat bots today feels like where we were with the commercial Internet in 1994. You couldn’t Amazon Dash or binge watch Netflix shows back then, but it was the dawn of a totally new stack of technologies — an advanced architecture which would power a scalable Internet and redefine everything.