16 April 2024
Introduction
Welcome to Part II of our series on Navigating AI Success Metrics, where we’ll dive into the nuances of understanding conversational data with AI. After laying the groundwork in Part I with the core concepts of Recall, Rejection, and Precision, we’re now turning our attention to the dynamic and often complex world of conversational data, like those found in Slack or Microsoft Teams interactions. The conversations held within these platforms are goldmines of insight, requiring sophisticated approaches to measure relevance. But first, a little review of the terms.
Recall, rejection, and precision: Key concepts revisited
In Part I of this series on Navigating AI Success Metrics”, we introduced the terms Recall, Rejection, and Precision, using first-pass review of emails as an example:
- Recall: The fraction of relevant documents correctly identified as “Responsive”.
- Rejection: The fraction of not relevant documents correctly identified as “Not Responsive.”
- Precision: The fraction of documents identified as “Responsive” that are relevant.
We found that, in general, Recall and Rejection are in tension with each other, making it difficult to develop a process that achieves high values for both. Ultimately, we care most about Recall (how good we are at keeping relevant content) and Precision (what fraction of our time downstream is spent considering relevant content.)
The complexity of measuring conversational data
We also briefly discussed the variability in performance when performing first-pass reviews on documents such as emails. With this foundation, Part II ventures into the more intricate realm of conversational AI, where the flow of dialogue presents unique challenges and opportunities for measuring success.
A real-world scenario: Using conversational data to investigate a company car incident
Consider the following scenario. A company car has been crashed in the parking garage by an employee who should not have had access to the keys, and you want to perform an investigation to find out what happened. Below is a conversation taken from Slack between two employees:
09:00 Alice: Morning! |
09:01 Bob: Hey Alice, how’s it going? |
09:02 Alice: I’m good! Feeling refreshed after the weekend. |
09:03 Bob: Great to hear it. |
09:04 Alice: I have a favor to ask. A bit of a strange one. |
09:05 Bob: Sure, what is it? |
09:06 Alice: Before I forget, the Sales call has been moved to 2 pm. |
09:07 Bob: No problem, thanks for keeping me updated! |
09:08 Bob: What was the favor? |
09:25 Alice: Um, I’m not sure I should ask. |
09:26 Alice: Can I borrow the company Chevrolet tomorrow? I need to move some kit to the other side of the site. |
09:45 Bob: I can’t see a problem. Let me check… |
09:46 Alice: Great, thanks so much! |
10:30 Bob: Bad news, I’m afraid. You’re not on the approved list of users for the cars. |
10:31 Alice: So I can’t get the keys? |
10:32 Bob: I’m not sure. |
10:33 Bob: Leave it with me. I’ll see what I can do. |
10:34 Alice: You’re the best! I’ll bring you back a coffee next time I head downstairs. |
14:05 Alice: Bob? Sales call time! |
14:06 Bob: Oops! Be there in a few minutes…. |
16:00 Bob: Thanks for the coffee! |
Challenges in labeling conversational data
If Alice is the employee who crashed the car, and Bob is the employee who gave her the keys, clearly, this conversation is relevant and should be marked as “Responsive.” But which messages should be marked as responsive? The messages contain information about the car, but also about a Sales call and Alice offering to get coffee for Bob. Depending on who labels this conversation, you might get all of the messages or some. Is the coffee a form of bribe for bending the rules? The conversation is also full of spikes of activity as the two employees reply to each other after long gaps or send messages before getting a reply to their previous message.
Beyond individual messages: Counting discussions
Simply counting individual messages in conversational data to measure Recall, Rejection, and Precision is not a good enough measure of performance. Conversational datasets are free-flowing, continuous, and have fuzzy edges. Two people will have different opinions on where a discussion starts or ends, and so would include different messages. Instead of counting messages, we should count the discussions and allow for some ambiguity in where a discussion starts and ends. When we find relevant content, it’s a good idea to include messages before and after the content to provide context.
Discussions give us a new way to think about conversational data that sets it apart from document-based data, such as emails. Instead of thinking about documents with a fixed start and end, where either all content is relevant or not relevant, we should think about message points in time, surrounded by context. Instead of providing lists of messages, we should provide lists of discussions. This can even reduce the overall number of documents for consideration in downstream workflows.
New approaches to conversational data with large language models (LLMs)
The bad news is that conversational data makes working with TAR and CAL much more difficult. To effectively use these technologies, we need to work with documents, and as we have seen, treating a single message as a document doesn’t work very well. We could consider groups of messages, but we would need to decide how to make these groups. Instead, we can lean toward new technology to address the problem.
Case study results: Applying LLMs to conversational data
How does this new way of thinking relate to Recall, Rejection, and Precision? In one test dataset, we tried a very simple measure of Recall in the following way. We applied a method using Large Language Models (LLMs), a form of generative AI, to identify messages as “Relevant.” We compared the actual labels in the dataset to the labels we obtained using LLMs, and simply counting the matching messages, we achieved a Recall of 1% (with a Rejection of 99% and Precision of 96%). That’s an abysmal value for Recall! Looking more closely at the dataset, we saw the LLMs would find a single message and label it as “Relevant,” but in the dataset, 10-20 messages were labeled as “Responsive.”
By including messages on either side of the “Relevant” messages as context, we could count overlapping discussions instead of individual messages. After implementing this specific refinement, we observed a noticeable improvement as the recall rate increased to 59%, alongside a rejection rate of 69% and a precision rate of 96%. This demonstrates the substantial impact of even a single adjustment. Other refinements enable to enhance these metrics further, but we wanted to highlight the significant gains achieved by taking into account the larger context of the conversation.
Transitioning from traditional methods to LLMs for conversational data
As we close Part II, we’ve discussed the need to redefine how we view documents in conversational datasets and how sticking to old definitions can lead us to underestimate key metrics like Recall, Rejection, and Precision. This insight has guided us away from conventional methodologies like Technology-Assisted Review (TAR) and Continuous Active Learning (CAL) towards using Large Language Models (LLMs), which are better suited for analyzing conversational data.
Look ahead to Part III
In Part III, we’ll synthesize the learnings from both parts into a hybrid dataset approach. This will help us better understand and apply these metrics in practice, particularly in legal contexts where precision is paramount. Stay tuned for a deeper dive into how these concepts combine to enhance our analysis and application of AI in first-pass review and investigation.
Explore the foundational insights of AI success metrics in email and document analysis by diving into the first part of our series: