IBM Watson Assistant

Product Design, Enterprise Software

Lead Product Designer

Disclaimer

The following case study is personal and does not necessarily represent IBM’s positions, strategies, or opinions. I have omitted and obfuscated confidential information.


Client

IBM

Timeframe

April 2019- Present

Role

Senior Product Designer

Type

Enterprise Software, Artificial Intelligence, Conversational Design


Overview

Watson Assistant is an AI conversation platform that helps users find quick, accurate, and straightforward answers to their questions across any application, device, or channel. I joined the Watson Assistant team as a Senior Product Designer in early 2019, and my first mission was to help the team develop a feature that leveraged a new algorithm to learn from user behavior and improve conversation quality.


My role

I was a Senior Product Designer on the Watson Assistant team. Working on this case study, I worked closely with product management, development, and research partners, as well as with junior designers and design researchers.


Challenge

In 2019, a team of IBM researchers had begun to perfect an algorithm that could learn from end-user conversations and automatically improve an assistant through a series of automatic modifications. The improvement of conversations has always been a manual and time-consuming process, but with this new algorithm, called autolearning, the assistant could start to remove the burden of improvement from the user. 

The algorithm was ready to integrate into the product experience, but there were challenges that the product team had to overcome to successfully integrate this new functionality. These challenges were unique, user experience-based, and comprised of difficult themes such as building trust and understanding of the definition of success.



Approach

Research

First and foremost, we had to understand how Assistant users defined and assessed the success of their assistants so that we could identify the appropriate measures to be used to communicate the impact of autolearning on the assistant's performance. We started with a series of interviews with product users to understand how they defined success.

We found out that it was generally difficult for users to articulate their definition of success. Some users indicated that building a successful assistant meant that the assistant had received sufficient training on topics to dialog seamlessly with their users, a concept called coverage. Others indicated that they wanted to know that the assistant could hold a full conversation without escalating to a live agent, a concept called containment. But all users agreed that containment and coverage did not paint a complete picture of conversational success. There was no way for our users to understand whether the conversations facilitated by their assistants were quality conversations that created a satisfying experience for the end-user.

Through a comparative and competitive analysis, we identified three potential candidates for our end-user experience metric: acceptance rate, customer effort score, and response quality score.

Sketches

Our research revealed some clues about what was important to our customers in assessing success and which metrics to pursue, but there was another key result that we needed to achieve with these designs.

We had to figure out how to gain the trust of our users to leverage autolearning. AI can be a kind of black box for users, and our goal was to make sure users understood what autolearning does. Autolearning provides automatic improvements to an assistant to improve the assistant's performance. It works by consuming and learning from the real end-user data from an assistant that is in production.

Thanks to our technical research, we knew that the algorithm was conservative and that the risk of negative effects on performance was incredibly low, but what did our users have to see to trust that the system would make the right choices? How could we make sure that users felt good about turning on autolearning for their assistant?

I started iterating with paper and pencil through several early designs, emphasizing the generation of concepts that would show enough transparency and confidence to earn the user's trust, and the right combination of metrics to prove autolearning’s value and impact. Starting on paper allowed me to quickly explore more than 20 concepts in a few days. After reviewing my sketches with my interdisciplinary team as well as with my fellow designers, we were able to reduce the 20 concepts to three potential candidates for testing with our users.


Concept testing

With our three concepts identified, we were ready to begin testing with users. I collaborated with a UX researcher on the team to develop a test plan. My research partner began fleshing out the test and recruiting users, while I created a video that would be used to introduce each user to the concept of autolearning. This video would allow each participant to receive the same explanation of the general premise behind autolearning and would help ensure consistency and limit bias in each test.

After testing our concepts with several customers, the results were in. Most of our testers favored customer effort score as their preferred user experience metric. Users expressed that making their end-user experience as easy as possible was their highest priority, and stated that the customer effort score allowed them to assess that experience in a meaningful way.

Testers also expressed enthusiasm for the general concept of autolearning. Autolearning would take care of improving some aspects of their assistant for them, freeing their attention to focus on building new dialog or editing problematic dialog nodes. The overall benefit of autolearning, coupled with the explanation of how it worked provided in concepts, gave users enough confidence to be willing to test the feature. The concepts that allowed users to compare performance metrics with and without autolearning demonstrated the value and impact of the feature. Also, users expressed that the ease of turning autolearning on and off shown in the concepts eased any remaining anxiety over trying out the feature. If they did not like what autolearning was doing, all they had to do was flip a switch and their assistant would return to its original state.

Iterations:

Several rounds of wireframe iterations followed our concept tests, with user testing and validation occurring with every cycle. Through these rounds of iterative testing, we learned that the customer effort score was not yet mature enough to be developed and launched with the autolearning. However, the sub metrics that we tested alongside the customer effort score tested well, with users stating that the sub metrics alone bolstered their confidence in autolearning and its benefits. Working with our squad leadership, we ultimately decided to pull the customer effort score from the first release in order to spend more time iterating upon the metric itself.

Through this testing and collaboration with our inter-disciplinary squad, we were able to build out an MVP for release within a few sprints.


Results

Autolearning was released in Watson Assistant in two phases: a closed beta phase was released in May 2020, and an open beta phase was released in December 2020.

Since its beta release, autolearning has been adopted by a number of Watson Assistant users and is actively improving assistants and driving positive impact towards coverage, containment, and customer effort sub metrics.

Next steps

As autolearning continues to mature, the team is actively working on enabling the feature to learn from even more elements of the conversation, and working to perfect the customer effort score metric.


Learn more about Watson Assistant.

Previous
Previous

Google: Jobs To Be Done

Next
Next

Patent: Conversational Recommendation