How to evaluate AI answers: making sure your assistant is not making things up

You put an AI assistant to answer your customers or your team. In the first tests it seemed brilliant — fluent, fast, convincing answers. And that is precisely where the danger lives. A language model is extraordinarily good at sounding convincing, even when it is wrong. Fluency is not a sign of correctness. The question that separates a serious AI project from a time bomb is simple to ask and hard to answer: how do you know your assistant is getting it right?

Most companies launch an AI assistant and measure success by feeling: "it seems to work well". But feeling misleads, especially when the answers are rarely verified against the truth. Evaluating AI answers in a disciplined way is not a luxury for those with time to spare — it is the difference between a system you can trust and one that, sooner or later, will give a wrong answer with full confidence, at the worst possible moment.

Why fluency misleads

A language model generates plausible text by predicting, word by word, what sounds good next. It does not consult a fact base nor verify what it says — it produces the most likely continuation. When that continuation coincides with the truth, we have a correct answer; when it does not, we have a "hallucination" that looks as confident as any other answer. The problem is that, from the outside, the two are indistinguishable: the confident tone is the same.

How to evaluate AI answers: making sure your assistant is not making things up

That is why human intuition fails at evaluating AI. We are used to associating fluency and confidence with competence — those who answer well and quickly usually know what they are talking about. With a language model, that association breaks: it always answers well and quickly, whether it knows or not. Trusting the feeling is trusting exactly the signal the model produces even when it is wrong.

The first step: a set of questions with known answers

You cannot evaluate what you do not measure, and you cannot measure without a reference. The starting point of any serious evaluation is building a set of representative questions — the ones real users ask — for which you already know the right answer. It is your "exam" for the assistant. Without this set, any judgment about quality is impression; with it, you have an objective score you can track over time.

Building this set forces a valuable exercise in itself: defining what a "right" answer is for each type of question. Often you find that even humans do not agree — and that discovery is gold, because it reveals ambiguities that would have to be resolved anyway. The evaluation set is not just a testing tool; it is a way of clarifying what you expect from the system.

The dimensions worth evaluating

Correctness: is the answer factually right? The most obvious and most critical dimension.
Grounding: is the answer based on the right documents, or did the model make it up? An answer right by luck is fragile.
Completeness: did it answer what was asked, or only part?
Safety: did it refuse to answer what it should not, and admit when it did not know instead of inventing?

The most valuable signal: knowing how to say "I do not know"

An assistant that always answers is more dangerous than one that sometimes admits ignorance. The ability to recognize limits — "I do not have information on that" instead of inventing a plausible answer — is one of the most important indicators of a trustworthy system. When evaluating, it is worth deliberately including questions the assistant should not have an answer to, and checking whether it has the humility to say so instead of filling the gap with convincing fiction.

This property does not happen by accident — it is designed. Instructing the model to admit uncertainty, giving it access only to the relevant information (with techniques like RAG) and penalizing inventions in the evaluation pushes the system toward honesty. An assistant that says "I do not know" in the 5% of cases where it does not know is infinitely more useful than one that invents in those 5% and undermines trust in the other 95%.

Continuous evaluation, not a one-off test

The classic mistake is evaluating once, at launch, and never again. But the world changes: the documents feeding the assistant update, users' questions evolve, the model itself can be updated by the vendor. A quality that was good three months ago may have degraded without anyone noticing. Evaluation has to be a continuous habit — running the "exam" regularly and tracking the score, to catch regressions before users catch them for you.

A concrete case

A company launched an internal assistant that answered employees' questions about procedures. In the first weeks, informal feedback was great — "it answers everything, it is amazing". But when they finally built a set of fifty questions with verified answers and ran the evaluation, the reality was different: the assistant got about 70% right, but in the remaining 30% it gave wrong answers with the same confidence as the right ones. Worse: several of the wrong answers were about sensitive policies, where an error had real consequences. The feeling of "it answers everything" had hidden a serious problem. With the evaluation set in hand, they identified that most errors came from outdated documents that still circulated. They cleaned the source, adjusted the instructions for the model to admit uncertainty, and the accuracy rose above 90% — with the remaining errors being, mostly, honest "I do not know" instead of inventions. The evaluation not only measured the problem but pointed to the solution.

In practice

Before trusting an AI assistant with something that matters, give it a real exam: gather real questions with known answers, run them, and measure how many it gets right — and, above all, what it does when it does not know. Fluency will keep impressing; evaluation is what tells you whether that fluency rests on truth. Is your AI assistant being evaluated against reality, or are you trusting the feeling that "it seems to work"?