Case Study · AI Response Quality
Why most low-scoring AI responses were incomplete, not incorrect
A shared evaluation system designed and run across seven product teams. The goal was to determine whether poor AI output came from content gaps, retrieval failures, or system behavior — and give each team a way to act on that distinction.
The problem
Teams weren't just unsure whether answers were correct. They had no reliable way to evaluate them.
Responses varied in subtle but important ways. Some were technically correct but incomplete — missing the context a user needed to act. Others sounded confident but contained small inaccuracies. Different reviewers often reached different conclusions about the same answer.
Without a shared evaluation model, feedback stayed subjective. Teams couldn't compare results, identify patterns, or determine where the system was actually failing. Issues persisted without a clear path to improvement.
The approach
I designed and ran the evaluation system end-to-end: recruiting seven product teams, building the 3C scoring model (Correctness, Completeness, Clarity), training evaluators, coordinating participation, and reporting findings directly to the system architect.
The system combined centralized Q&A datasets, standardized scoring, and repeated test cycles. Each cycle was designed so findings could feed directly back into content, retrieval, and system tuning — because the organization owned the full RAG stack.
What the data showed
The evaluation separated low-scoring responses by failure type — completeness failures versus accuracy failures. Across 949 responses from seven teams, the average score was 3.63, with 35% falling below the target threshold.
The dominant failure mode was missing context: responses factually accurate but lacking the adjacent information a user needed to act. The data made visible which layer of the knowledge system was responsible for each failure type.
Content
- Coverage gaps led to incomplete responses
- Over-chunked content reduced available context
- Weak structure made key information harder to retrieve
Retrieval
- Index gaps prevented high-quality content from surfacing
- Ranking returned partial or lower-quality matches
- SEO gaps limited discoverability of relevant content
System
- Responses varied across identical inputs
- Failures clustered around specific question types
- Tuning lacked consistent, structured evaluation signals
How the findings were used
Per-team scorecards named specific gaps, not just overall scores. Content teams received prioritized improvement lists based on question-level failure patterns. Retrieval issues were traced to specific breakdowns in indexing, ranking, and content coverage. Engineering teams received structured findings they could act on — not anecdotal feedback.
The evaluation gave each team a clear picture of which levers they controlled and where their failures were clustering. For many, it was the first time that picture existed.
A second cycle was planned but not executed before the program context changed. The reusable output was the evaluation framework itself: shared scoring criteria, test-cycle structure, and reporting patterns that teams could apply to future AI response reviews.