How to evaluate a generative AI system in practice

A generative AI demo impresses in minutes. You write a prompt, the answer looks good, and the room nods. The trouble starts when that same system begins answering thousands of real questions, from real users, about cases nobody tested. That is when the question shifts from "does it look good?" to "how do I know it is good, consistently and measurably?".

Evaluating a generative AI system is different from evaluating traditional software. There is no single "correct" output: the same question allows several acceptable answers and many wrong ones that sound convincing. On top of that, behaviour changes when you alter the prompt, the knowledge base or the model version. Without an evaluation process, every change is a leap in the dark.

This article offers a practical way to evaluate these systems — from first prototype to production — without falling into the excess of academic metrics or the naivety of trusting the first demo. The core idea is simple: define what success means, measure it repeatably, and look at quality, cost and risk together.

What it means to evaluate a generative AI system

Evaluating is not giving a subjective grade to a single answer. It is measuring, repeatably, whether the system meets the goal it was built for, across many representative cases. A customer support assistant, a summary generator and a code copilot have distinct goals and therefore distinct evaluation criteria.

How to evaluate a generative AI system in practice

It helps to separate three layers that are often conflated. The first is the model itself, for example an LLM. The second is the system around it: prompts, context retrieval through RAG, rules, filters and tools. The third is the end user's experience. An answer can be technically correct and still fail because it is too long, arrives too late or does not cite its source. Evaluating well means looking at all three.

Define the task and success criteria before the metrics

The most common mistake is choosing metrics before defining what you want. The right order is the reverse: first describe the task precisely, then what counts as a good answer, and only at the end choose how to measure.

It is worth writing down, in plain language, three things: what the system must do, what it must never do, and what a good enough answer looks like. In an internal human resources assistant, for example, a good answer cites the correct policy, does not invent figures and routes to a person when the case is sensitive. These criteria then become the basis of the tests.

Build a representative evaluation set

You do not evaluate a system with three hand-picked questions. You need a set of cases — sometimes called a golden set — that represents real use: frequent questions, hard cases, ambiguous requests and situations the system should refuse. Fifty to two hundred well-chosen cases are worth more than thousands generated at random.

A few principles help to build this set:

Coverage: include the most common topics and intents, but also the rare and risky ones.
Negative cases: out-of-scope questions, to check whether the system refuses or routes instead of inventing.
Reference answers: whenever possible, an ideal answer or the facts the answer must contain.
Updating: revise the set when new question types appear in production.

Metrics that make sense: quality, faithfulness and safety

There is no single metric. It is best to combine a few, depending on the task. For tasks with a well-defined answer — classify, extract, answer facts — you measure things like accuracy, recall and precision. For free text, automatic word-overlap metrics say little about real quality; you need to assess faithfulness to context, relevance and clarity.

Three dimensions tend to be decisive: quality (is the answer correct, complete and useful?), faithfulness (does the answer rely on the sources provided or does it invent?) and safety (does the answer avoid dangerous content, data leaks and an inappropriate tone?). Measuring only the first and ignoring the other two is like validating a car by its speed without looking at the brakes.

Automatic, human and model-as-judge evaluation

There are three ways to score answers, and good judgement lies in combining them. Human evaluation is the most reliable for nuance, but it is slow and expensive. Automatic evaluation by rules works when there is a verifiable answer: a number, a code, a fact. And there is the approach of using one model to evaluate another, the so-called LLM as a judge.

Using an LLM as a judge is appealing because it is fast and cheap, but it requires care. The judge may share the same biases as the model under test, favour long answers or be sensitive to the order of the options. The healthy practice is to calibrate the judge against a human-scored set and use it for triage, not as the final verdict on critical decisions.

Hallucinations and faithfulness to the source

Hallucination — when the model states something false with confidence — is the risk that most undermines user trust. In systems with RAG, the key question is not just "is the answer correct?", but "is the answer supported by the retrieved documents?". An answer that is right by chance, without support in the sources, is a problem waiting to happen.

To measure this, you check whether each relevant claim in the answer is found in the provided context. This can be done with human review on samples and with automatic checks that compare the answer to the sources. When the rate of unsupported claims rises, the problem is usually in retrieval — wrong or insufficient documents — and not in the model.

Cost, latency and robustness: evaluating beyond quality

An excellent system that costs too much or takes too long never reaches production. Serious evaluation also measures the cost per answer (number of tokens and calls), the latency (time to first word and total time) and the robustness (what happens with badly written, very long or multilingual questions).

It is also worth testing stability: the same question asked several times should give consistent answers. Huge variation between runs is a sign that the system is fragile and hard to control. These measures are not technical details; they are what separates a prototype from a product.

Common mistakes when evaluating generative AI

A few patterns repeat across many teams. The first is evaluating only with easy examples and concluding that it always works. The second is trusting an automatic text metric as if it were absolute truth. The third is testing once, approving, and never measuring again — when a single model-version change can turn everything upside down.

There is also the mistake of mixing the evaluation set with the examples used to tune prompts: if the system studied for the test, the results deceive. And, finally, the mistake of looking only at the average and ignoring the worst cases — it is often one bad answer, at a sensitive moment, that destroys a customer's trust.

Mini case: an internal assistant at a services company

A financial services company built an assistant to answer employees' questions about internal policies, with RAG over its documents. In the demo, everything looked perfect. Before opening it to the whole company, the team prepared a set of 120 real questions, with reference answers approved by the compliance area.

The first evaluation was revealing: 82% of the answers were correct, but 15% contained claims not supported by the documents, and the average response time was nine seconds. On investigation, the team realised the problem was in retrieval, which was returning outdated documents. It improved the indexing and added an instruction for the system to refuse when it found no basis. In a second round, unsupported claims fell to 4% and latency to four seconds. Only then did it go ahead — and it kept the evaluation running weekly to catch regressions.

In practice

Evaluating a generative AI system is not a one-off exam, it is a habit. Start by defining what success means, build a representative set of cases, measure quality, faithfulness and safety together, and do not forget cost and latency. Use human evaluation where it matters and automate the rest so you can repeat the measurement at every change.

The reward is twofold: fewer unpleasant surprises in production and an objective basis for deciding. In generative AI, the difference between an impressive toy and a trustworthy product is not in the model — it is in the discipline with which we evaluate it.