Case Study · AI Response Quality

Found that most low-scoring AI responses were incomplete, not incorrect

Scale 7 product teams · 949 responses scored

Scoring model Correctness, Completeness, Clarity

Context Internal RAG system, customer support

I designed and ran a shared evaluation system across seven product teams to find whether poor AI responses came from source content, retrieval, or system behavior, and help each team choose the right fix.

The problem

Teams had no reliable way to judge AI responses. Some answers were correct but incomplete, while others sounded convincing but contained small errors. Reviewers often scored the same answer differently.

Without shared criteria, teams could not compare results, find recurring failure patterns, or tell whether a problem came from content, retrieval, or system behavior.

The approach

I designed and ran the evaluation system: recruited seven product teams, created the 3C scoring model, trained evaluators, coordinated participation, and reported findings to the system architect.

The system combined centralized Q&A datasets, standardized scoring, and repeated test cycles. Each cycle was designed so findings could feed directly back into content, retrieval, and system tuning — because the organization controlled the content, retrieval, and response system.

1

Build test sets — Real questions and reference answers, organized by product team.

2

Score responses — Standardized 3C model applied consistently across all 949 responses.

3

Compare test cycles — Repeated test rounds make performance changes visible over time.

4

Find patterns — Patterns identified across teams, question types, and system behavior.

5

Apply the findings — Results inform content remediation, retrieval tuning, and system improvements.

What the data showed

The evaluation separated low-scoring responses by failure type — completeness failures versus accuracy failures. Across 949 responses from seven teams, the average score was 3.63, with 35% falling below the target threshold.

Key finding 71% of low-scoring responses were incomplete, not incorrect.

The dominant failure mode was missing context: responses factually accurate but lacking the adjacent information a user needed to act. The data helped teams trace each failure to likely problems in content, retrieval, or system behavior.

Content

Coverage gaps led to incomplete responses
Over-chunked content reduced available context
Weak structure made key information harder to retrieve

Retrieval

Index gaps prevented high-quality content from surfacing
Ranking returned partial or lower-quality matches
Relevant content was harder to retrieve when product context appeared only in metadata

System

Responses varied across identical inputs
Failures clustered around specific question types
Teams lacked consistent evaluation data for system tuning

How the findings were used

Each product team received a scorecard showing specific content, retrieval, and system problems. Content teams received prioritized fixes based on question-level failures. Retrieval problems were traced to indexing, ranking, and coverage gaps. Engineers received structured findings instead of anecdotal feedback.

The framework showed each team which problems it could address and where failures were clustering.

A second cycle was planned but not executed before the program context changed. The reusable output was the evaluation framework itself: shared scoring criteria, test-cycle structure, and reporting patterns that teams could apply to future AI response reviews.