Stop Building RAG Systems That Hallucinate: 3 Metrics That Will Show You What Is Broken

July 27, 2024

"We ask the same question 10 times and get 10 different answers. Our RAG system is completely unreliable."

This was the first thing a frustrated tech lead told me when they reached out for help. Their team had spent months building what they thought was a sophisticated RAG system, feeding it massive amounts of context, convinced that more data would solve their accuracy problems. But here's what they didn't realize: they were measuring everything except what actually mattered.

When I asked them about their baseline metrics, silence. When I asked how they evaluated retrieval quality, more silence. They had four custom metrics focused entirely on the final output—relevancy, correctness, tonality—but zero visibility into whether their system was even finding the right information in the first place.

Here's the uncomfortable truth most AI teams refuse to acknowledge: RAG isn't a generation problem, it's a retrieval problem. You can have the most sophisticated LLM in the world, but if you're feeding it irrelevant chunks (like this client's 20 random chunks per query), you're essentially asking it to hallucinate with confidence.

Treat AI as the algorithm it is: measure to improve

In this guide, you'll discover the 3 most critical metrics that reveal what's actually happening behind the scenes in your RAG pipeline—and the target scores you should aim for to eliminate hallucination. After helping over 10 engineering teams put RAG into production (from startups to stock-exchange enterprises), I've learned that most software engineers excel at building systems but struggle with AI quality evaluation—a skill that comes naturally from ML and data science backgrounds.

If you know what to measure, you can finally see what's broken, make data-driven improvements, and boost your system's reliability. The client I mentioned? I took their correctness score from 0.45 to 0.8 by focusing on these metrics instead of their original four. This isn't about code implementation—it's strategic guidance on measurement that will transform how you approach RAG evaluation and give you the visibility you need to make your users (and stakeholders) happy.

The solution isn't more context or better prompts—it's treating RAG like the ML system it actually is, with proper evaluation, iteration, and metrics that matter. IF you don't start measuring where you're at, you will never know what improves the system.

Why measuring retrieval is the game-changer most teams ignore

Here's what I see in 8 out of 10 RAG implementations: teams obsess over prompt engineering and model selection while completely ignoring whether their system is even finding the right information. They measure the final output—correctness, relevance ormtone—but have zero visibility into the retrieval pipeline that feeds their LLM.

This is backwards thinking. With today's LLMs (GPT-4, Claude, Gemini), generation quality is rarely the bottleneck. These models are remarkably good at synthesizing information when given the right context. The problem isn't that your LLM can't write a good answer—it's that you're not giving it the right information to work with.

Think of it this way: if you ask a brilliant consultant to analyze a business problem but only give them irrelevant documents, even their expertise can't save the output. That's exactly what happens when your retrieval system returns 20 random chunks and expects the LLM to magically find the needle in the haystack.

When you measure retrieval separately from generation, you gain surgical precision in diagnosing problems. If your overall correctness is 50%, you need to know: is this because the right information isn't being retrieved (retrieval problem) or because the LLM isn't synthesizing it properly (generation problem)? Without this separation, you're flying blind.

The revelation for most teams is discovering that their "AI problem" is actually a search problem—and search problems have well-established solutions.

The 3 metrics that reveal everything (and the scores you need)

After analyzing dozens of RAG implementations, I've distilled RAG evaluation down to only three essential metrics that give you complete visibility into your system's performance:

1. Retrieval Precision: Quality of what you find

What it measures: Of all the chunks your system retrieves for a question, how many actually contain information relevant to answering it?

Formula: Precision = Retrieved Relevant Chunks / Total Retrieved Chunks

Why it matters: Low precision means you're cluttering your LLM's context with noise. When you retrieve 20 chunks but only 3 are relevant, you're essentially asking your LLM to find signal in a sea of noise—a recipe for hallucination.

Target scores:

Good range: 0.50 - 0.70
Great range: 0.70+

2. Retrieval Recall: Coverage of what exists

What it measures: Of all the relevant chunks that exist in your knowledge base for a given question, how many did your system successfully retrieve?

Formula: Recall = Retrieved Relevant Chunks / Total Relevant Chunks

Why it matters: This is your most critical metric. High recall ensures that when relevant information exists, your system finds it. Missing key information guarantees wrong answers, no matter how good your LLM is.

Target scores:

Good range: 0.65 - 0.80
Great range: 0.80+

3. Generation Correctness: Final answer quality

What it measures: How often your system provides factually correct and complete answers to user questions. You could see this as the accuracy of the system. I use the term 'correctness' as this is easily measured in a yes or no answer--"is this answer correct to the query".

Why it matters: This is your north star metric, what users actually care about. But measuring it in isolation tells you nothing about where to improve.

How to measure it: Ideally to measure correctness we need user data. A copy with "Was this answer correct?" with a thumbs up/down button, will get you all the data you need. However most of you reading this will have a RAG system which is still in a testing phase, that means we probably don't have access to real user data. To mitigate this, I advise my clients to create a synthetic question-answer dataset of at least 100 questions. Yes it takes some manual effort, but it will give you all the datapower you need to take a look behind the scenes. Per experiment you can measure correctness with an LLM-as-a-judge or by manually evaluation the 100 questions. If you have the subject expertise, I do recommend the latter since it is 100% accurate. It does take more time ofcourse, but it shouldn't take more than 30-60 minutes per experiment. A time investment highly worth it.

Target scores:

Good range: 0.70 - 0.85
Great range: 0.85+

What to do with this information

Now that we have made the initial evaluation, we can start analyzing what the scores actually mean. The magic is found in tracking all 3 of them together. Let's go over some examples:

Low correctness

As in most faulty RAG systems, your correctness will probably be low, otherwise there is no problem to begin with. However instead ofgoing straight into prompt tuning, we can now look at the retrieval metrics. If these both, or one of them are also low, we know that there is a retrieval problem. If this gets fixed, our correctness will drastically improve. In the small chance that it doens't, it means you have a generation problem. Only now I finally allow you to tune your prompts.

Low precision

If you find a low precision score, it means that you give the LLM a lot of noise. This should with more advanced LLM's not neccessarily a problem, but a less intelligent model like gpt-3.5-turbo or gpt-4o-mini will definitly struggle with answering.

You can try testing the system with a better model to see if the correctness improves or you can try fixing your precision. The easiest way to do this is to give the LLM less clutter, e.g. lower your top_k parameter. This tells the RAG system how many chunks to give to the LLM to base the answer on. If your top_k is very high, you will probably also give the LLM irrelevant information.

Low recall

Lastly, if you find a low recall score. It means that the LLM doesn't even have the relevant chunks to answer the query to begin with. Now you can understand that this is a massive problem. Without the correct information it is impossible to answer the question. The LLM will guess, the answer will be incorrect and the user will be unhappy. One big mess.

This metric is a bit harder to improve and will require more testing to get it right. SOme examples include:

increase your top_k
utilize metadata/query understanding
hybrid search (combination of full text search and vector search) I will release a more in-depth guide on this soon.

Three common mistakes that are killing your RAG performance

Mistake #1: Measuring only the final output

Most teams create elaborate scoring systems with multiple dimensions—relevance, correctness, tone, completeness—all focused on the final answer. This is like judging a restaurant only by the final dish while ignoring whether the chef received the right ingredients.

I recently worked with a team that had four custom metrics and spent weeks debating scoring criteria. When we simplified to just correctness and added retrieval metrics, they identified their core problem in two days: their chunking strategy was off, splitting critical information across multiple chunks, combined with a low top_k destroying recall. Ofcourse we made also some more advanced improvements, but simple improvement yield normally the highest initial gain

Mistake #2: No baseline measurement

"How do you know if your changes are working?" I ask this question in every RAG consultation, and the answer is usually silence or vague statements about "feeling better."

You can't improve what you don't measure. Before making any optimizations, establish baseline scores for all three metrics using a representative dataset of 100+ questions. This becomes your foundation for systematic improvement.

Mistake #3: Overcomplicating evaluation

Teams often create complex evaluation frameworks with subjective scoring, multiple reviewers, and elaborate rubrics. This slows down testing iteration and makes it harder to identify clear improvement signals.

The most successful RAG implementations I've seen use simple, objective metrics that can be calculated automatically. I recommend focussing on correctness as your primary metric, measure retrieval separately, and iterate quickly based on clear signals. After you nailed correctness, you can move over to more subjective measures like tonality.

Remember: the goal isn't perfect evaluation—it's actionable insights that drive systematic improvement. A simple metric you can calculate daily beats a perfect metric you can only run monthly.

Ready to fix your RAG system?

Now you know what to measure, but implementing these metrics and interpreting the results is where most teams get stuck. The difference between knowing these concepts and successfully applying them to your specific system is often the difference between a struggling RAG implementation and one that consistently delivers reliable results.

If you found this helpful, I share more insights like this in my newsletter—practical frameworks for building reliable AI systems, lessons learned from real implementations, and the occasional deep dive into what's actually working in production.

And if you're dealing with a specific RAG or AI challenge right now, I offer limited 30-minute consults to help teams figure out what's going wrong and what to focus on first. I only have 4 spots per week, so first come first serve. Sometimes an outside perspective can spot the issues that are hard to see when you're deep in the code.

Join the newsletter → | Book a 30 min consult →