How to actually measure LLM accuracy in production
Beyond vibes: a working evaluation framework for LLM systems in production. The specific metrics, golden datasets, and cadence that catch regressions before users do.
Most teams evaluate their LLM systems by "does it seem to work?" That holds up until a model update, a retrieval tweak, or a prompt change silently breaks 15% of outputs and no one notices until customers complain.
Here is the evaluation framework we use. It is boring, cheap, and works.
Step 1: Build a golden dataset (100-500 examples)
A golden dataset is a collection of realistic inputs paired with correct outputs (or scoring rubrics for open-ended outputs). For a typical document-Q&A system:
- Collect 100-500 real user questions (or synthesize realistic ones if you’re pre-launch).
- For each, annotate: the correct answer, the expected citation, whether the system should refuse.
- Include 10-20% adversarial or edge cases (ambiguous questions, queries about topics not in your corpus, prompt injection attempts).
- Store as JSONL. Version-control it.
This takes 2-3 days of one person’s time. It is the highest-leverage 2-3 days in an LLM project.
Step 2: Define metrics that match the product
Generic "accuracy" is useless. Pick metrics that match user-visible outcomes:
- Answer correctness — semantic match or rubric score against gold answer
- Citation validity — does the cited source actually support the answer?
- Refusal rate — does it correctly refuse when data isn’t in the corpus?
- Hallucination rate — answers that claim facts not in retrieved context
- Latency (p50, p95) — if this drifts, users notice before accuracy regressions
- Retrieval recall@k — for RAG systems, is the right document even in the top-k?
Pick three to five. Track them over time. Alert on deviations.
Step 3: Grade with a judge (LLM or human)
For closed-ended answers, string match or regex works. For open-ended answers, you have two choices:
- Human grading. Accurate, slow, expensive. Good for the first 100 examples to build trust in the framework.
- LLM-as-judge. Fast, cheap, imperfect. Use a strong model (e.g., GPT-4 or Claude) with a detailed rubric to grade the outputs of your production model.
In practice: grade 50 examples by hand to build a rubric. Then have an LLM grade all 500 using that rubric, with spot-checks from a human on 10% weekly.
Step 4: Run it nightly, not before releases
"We run evals before every deploy" is a trap. By the time you catch a regression, the pipeline change is 3 days old and the engineer has context-switched. You want evals running every night automatically, with alerts on deltas beyond 2-3%.
Stand up the eval as a cron job or CI scheduled run. Post results to Slack or email. Treat sustained drops as incidents.
Step 5: Separate offline evals from online metrics
Offline evals (your golden dataset) catch regressions. Online metrics (user behavior) reveal whether the system is actually valuable:
- User thumbs-up / thumbs-down rates
- Follow-up question rates (high = the first answer didn’t satisfy)
- Rate of copy-paste from output (users finding it useful)
- Time-to-next-action
Online metrics are noisier and harder to attribute. Offline metrics give you fast signal on changes. You need both.
Common evaluation mistakes
- Too small a dataset. 30 examples is vibes, not evaluation. Aim for 100+ minimum.
- Test set contamination. If your golden questions leaked into your prompt engineering, you’re grading yourself on training data. Hold out 20% for a blind split.
- Single-score aggregates. An 82% overall score hides the fact that you’re failing on medical-dosage questions. Report metrics by category, not just in aggregate.
- Skipping the "should refuse" category. The system needs to know when not to answer. This is half your production failures.
- Judging correctness but not citation. A correct answer with a wrong or missing citation is a compliance failure in regulated contexts.
What "good enough" looks like
For a policy-Q&A or document-retrieval system in production we typically target:
- 90%+ answer correctness on the golden set
- 95%+ citation validity
- <5% hallucination rate on out-of-scope questions
- >90% correct refusal on malformed or adversarial inputs
- p95 latency under 3 seconds for the majority of queries
If you can’t measure these today, you can’t know whether tomorrow’s change improved or regressed the system.