Getting AI-powered features past the post-MVP slump
December 1, 2024
Basically every product out there has added ✨AI-powered✨ features in the last year or so. I suspect many engineering teams are now struggling to maintain and improve these LLM-based systems.
Many of these exciting product features stalled out at MVP, and haven’t really delivered on their promise. Why?
In this article series I’ll talk about the common challenges that product teams are running into as they try to take AI features beyond the initial MVP, and how to overcome them.
A lot of these challenges are eminently solve-able, you just need to know the right techniques!
The first 80% of an AI-powered feature is easy…
The new era of generative AI has made it shockingly easy for engineers to add machine learning features to their products, even without any prior background in ML or data science.
I suspect the path for a lot of these features have gone something like this:
- an engineer whips together a very cool proof-of-concept using an LLM (plus maybe a vector store) to add an ✨AI-powered✨ feature to their product. Maybe a feature to chat with documents inside the product, or generate content, or get AI assistance with a complex workflow.
- leadership sees a demo and are bowled over by what was built in a such a short period of time. It has some rough edges, but the potential is very exciting!
- product jump on it, and decide to prioritize turning this rough proof-of-concept into an MVP.
- Engineers toil away to get the MVP live - learning a lot about prompt engineering, embeddings, RAG, context windows, and what-not along the way.
- The MVP launches (with the obligatory sparkle emoji), and users are excited.
- Post-launch, issues begin to crop up. Maybe the AI generates inappropriate responses at times, or just isn’t particularly useful in some contexts. Maybe some cheeky users are causing havoc with prompt injection. Engineers begin to discover that improving the behavior of their new system is quite challenging.
I think this post-launch slump is what a lot of product teams are struggling through right now. Improving the performance of a generative AI system can be a game of whack-a-mole - you tweak a prompt or data-processing step to fix an issue in one area, but unintentionally make things worse somewhere else. The non-deterministic, black-box nature of LLMs can become incredibly frustrating to engineers who are used to being able to reason through a problem to the root cause and then apply a fix.
The standard methodologies that a product engineer would apply here - automated testing, QA and reproducing the bugs, stepping through with a debugger - aren’t that helpful. Happily, there are some tried and true techniques from the machine-learning world that we can use. First and foremost, let’s take a look at evals.
You must establish a feedback loop
The non-negotiable first step in systematically improving your AI systems is establishing a solid feedback loop.
How do you know if your AI-powered features are producing good results? As product engineers we’re used to relying on automated testing in CI along with observability in production - monitoring for increases in exceptions, 500s, latency, and so on. But LLMs don’t work this way. When they fail they just provide bad output (where “bad” can range from “unhelpful” through “hallucinations” to “legal violations”).
I suspect that most product teams roll out the MVP of their AI features without any systematic way to monitor how well it’s working, beyond “Vibe Checks” - manually checking that the AI’s output matches expectations. This leaves you without a usable feedback loop. For example, you want to make some changes to a prompt to reduce hallucinations. How do you know if it worked? How do you know that the change didn’t make results worse in other ways? The same problem persists for other changes you might want to make: switching to a new version of your LLM model, or switching to a totally different model, breaking up your prompt into multiple steps, and so on. You need a good feedback mechanism if you’re going to make these important changes with confidence.
Evals are your friend
Evals are the feedback mechanism you need. In more academic contexts, an eval is a formalized set of benchmark tests for a specific ML problem, used to objectively compare the performance of different solutions. However, in the context of AI-powered product features an “eval” is a way to systematically and consistently measure the behavior of your AI system, end to end.
These evals will measure quite a broad variety of behavior, both functional (e.g. does the AI include certain key pieces of information when asked a specific question) and non-functional (e.g. does the AI refuse to perform inappropriate tasks, does the AI respond in a concise way).
In many ways these evals resemble the sort of automated tests that any good product engineering team create (with “vibe checks” being equivalent to manual testing), but there are some key differences. Evals aren’t always a binary pass/fail. Instead you’re often looking for whether some metric - precision, recall, MSE, AUC-ROC, etc. - continues to perform as expected. Whenever engineers make a change to the AI - a tweak to a prompt, switching to a different model version, modifying a data pipeline - they can run these evals and get a holistic sense of what the impact was.
Iterating on an AI system with just manual “vibe checks” to measure progress is often an infuriating experience of wack-a-mole - you tweak a prompt to fix a specific problem and it seems to improve outputs for the input you were testing, but then you later notice that outputs for other inputs have gotten worse. The black-box, non-intuitive nature of GenAI models might make this sound frustratingly familiar. Evals are your path to sanity. Rather than manually checking specific issues you can systematically test the whole system for every change.
Bespoke, domain-specific evals are the gold standard
It’s quite easy to get started by apply canned non-functional evals - tools like Ragas, braintrust, and langsmith can let you start testing your output for things like “factuality”, JSON schema compliance, answer relevance, and more.
However, these generic evals are way less valuable than custom evals that test the domain-specific qualities of your AI - making sure that a specific item is always provided (or never provided) for a given recommendation request, or that a specific set of items are always ordered by the AI in the expected way.
What do these domain-specific evals look like? Here are some hypothetical ones for a restaurant recommendation system:
@pytest.mark.parametrize("query", various_budget_queries)
def test_no_expensive_for_budget_queries(query):
response = get_restaurant_recommendations(query)
assert all('$$$$' not in r.price for r in response.recommendations)
@pytest.mark.parametrize("query", various_queries_which_specify_dc_area_restaurants)
def test_restaurant_locations(query):
response = get_restaurant_recommendations(query)
assert all('DC' in r.location for w in response.recommendations)
@pytest.mark.parametrize("query", various_queries_which_specify_vegan)
def test_food_restrictions(query):
response = get_restaurant_recommendations(query)
assert all('vegan' in r.category_tags for w in response.recommendations)
@pytest.mark.parametrize("query", various_queries)
def test_recommendation_count(query):
response = get_restaurant_recommendations(query)
assert 3 <= len(response.recommendations) <= 7
In these tests we pass various curated inputs into the system and verify that all the recommendations that come out satisfy some criteria specified in those outputs. When we ask the system for low-cost restaurants, none of the restaurants should be pricey. When we ask the system for recommendations for restaurants in DC, all the recommendations should… be located in DC!
Evals need manually labeled data
This might seem obvious, but building your domain-specific functional evals is going to require manually defining “good” and “bad” responses from the AI for given inputs. Put another way, you’re likely going to need to manually label some data, similarly to if you were building a training dataset for supervised learning. You will want to lean on the subject-matter experts that you have within your organization to do this labelling - folks in customer success, sales engineers, account managers, product managers, whoever is best positioned to define what good and bad looks like in terms of AI output.
Some version of this labelled data could end up also being used for in-context learning (examples within an n-shot prompt), or even for fine-tuning, but the initial focus should be gathering a solid validation dataset for evals.
You can start building this labeled dataset using simple tools - a shared google sheet or similar - but often it becomes worthwhile to build some simple internal tooling to streamline this labelling process1. Low-code tools like retool, airtable, bubble, streamlit are a good fit for this.
Spend more time with your production systems
In the next part of this series I’ll talk about how uniquely valuable your production systems are when it comes to AI features. We’ll look at what you should be monitoring (both manually and automated), how to systematically incorporate user feedback, and how to harvest labeled data from production examples.
- Hamel Husain has some nice in-depth guidance on building internal data annotation tools for here. [return]