Retrieval-augmented generation makes it easy to ship a demo and hard to ship a product. The demo works on the examples you tried; production surfaces the ones you did not.
Measure retrieval and generation separately
When answers are wrong, the cause is usually retrieval, not the model. Evaluate whether the right context was retrieved before blaming generation — they are different failures with different fixes.
Build the eval set early
- Collect real questions and expected answers from day one.
- Automate scoring so every change is measured, not guessed.
- Track quality over time the way you track latency and cost.
Teams that invest in evaluation iterate with confidence. Teams that do not ship vibes.