I thought Tim O’Reilly’s recent article on Alexa made some really excellent points. The largest that stood out to me was that Alexa is always on and helpful, where the other solutions such as Siri or Cortana only listen when you ask them. Every time you start a new conversation, it is treated as totally independent to all other conversations.
With every conversation being independent, the context of a conversation is lost, or so minimal as to be perceived to be lost, within the conversational interface.
For an example, imagine this sequence with a conversational agent (Siri in this case):
- (speaker) “tell karen i’m on the way home“
- (agent) [shows the message to be sent]
- (speaker) clicks “send”
And then a few seconds later
- (speaker) “where is she?“
- (agent) “Interesting question, Joseph“
This last is Siri’s code phrase for “WTF are you talking about?”. To most speakers, it is very clear that I’m asking about Karen. In my expectation for a conversational agent, this context would be retained. I would expect Siri could understand the relevance of Karen in that sentence.
Siri does a reasonable job with disambiguating some of these unknowns into relevant specifics. When it is not sure what you mean, it asks. For example, the question “where is Karen” in my phone brings up “Which karen do you mean?” and shows me a list of potential choices from my address book. Once I clearly identify which one, I would hope that it would retain that context. In conversation, we often maintain a set of expectations that help inform and apply relevant context. And this is where most conversational agents currently break down – they don’t maintain, even for a short duration, any conversational context – no “knowledge” is built up or maintained, we’re just talking to the same clean slate every time.
When a conversational agent asks for clarification, it’s also making us expect something which many of the conversational agents do not have, or have only in a very limited sense: agency. We expect that the conversational agent will act independently and have it’s own actions in the world. Siri does not, however, function like that. Instead it tries to interact as though it’s acting for you – presenting augmented actions of your own rather than acting as an independent entity.
Here’s an example that illustrates that point:
- (speaker) “tell karen I’m thinking of her“
- (agent) [shows the message “I’m thinking of her”]
What I would expect is a grammar translation – a message to be sent reading as “I’m thinking of you”.
One of the user experience benefits that Tim O’Reilly’s article asserted as a positive is that Alexa didn’t have a screen to show options or choices on. This forced their software to use conversation exclusively to share information, which is a different choice (I’m not sure it’s arguably better or worse) than Siri and Cortana. I personally prefer the multi-modal aspect of the interaction, simply because it’s often easier for me to visually scan a list of choices than it is to listen to options, where I end up feeling like I’m in phone response hell. “Please listen carefully to the following options, as they may have changed…” is 7 seconds of my life loose every freakin’ time.
As conversational interfaces grow, I think this is one area that Cortana and/or Siri may have a distinct advantage – the capability or reacting and interacting in not just a conversational manner, but visually as well. Microsoft’s earliest experiments in this space: the well intentioned but annoying-as-hell and now greatly maligned “clippy” was all about attempting to understand context and intent by watching actions and trying to be predictive. My opinion on “The Big Mistake” was the initial concept that you should have such a system ever interrupt the actions being taken without invitation.
That said, the capability of using multi-modal inputs is incredibly powerful, and I don’t think we’ve even seen the initial effective use of it. The desktop era of the late 80’s added mouse movement to a keyboard, and the past decade added very effective touch interactions. Siri and Cortana are now adding voice as an input, expanding the way we could interact with our devices. Games have been starting to do this over the past few years, as game consoles have added games that take voice commands in addition to their controller inputs. Augmented reality systems could potentially add in the camera view they have, and with the object recognition technology that’s starting to be effective add even another layer of information for an agent to work with.
The human equivalent is easy to see – you can get so much more from a voice conversation than a written one because you have the additional information of tone and timing to help intuit additional context. A video conference provides even more information, using body language including facial expressions to share additional channels of information. These additional “modes” provide a broader amount of expression that are only starting to be explored and utilized in areas like affective computing