Data quality: validate before the error spreads

There is a moment that undermines the confidence of an entire organization: when someone points at a number on a dashboard and says "this is wrong", and is right. From then on, every report is under suspicion. And the most frustrating part is that, almost always, the error was not born in the report. It entered much earlier, in a pipeline nobody was watching. That is why data quality cannot be an afterthought: it has to be checked automatically, on every load.

The rule is simple to state and transformative in practice: the error should be caught by the system, not discovered by the user.

Why a fast pipeline is not enough

It is tempting to optimize pipelines only for speed and volume. But there is no point in delivering data fast if that data is wrong, in fact it is worse, because a fast error spreads fast. A missing value, a duplicate key or an absurd date travel through the whole system and end up in a board report, where someone makes a decision based on them. Speed without quality is not efficiency: it is accelerated risk.

Data quality: validate before the error spreads

The checks that pay off

You do not need a complex system to gain 90 percent of the reliability. A set of simple checks, run on every load, catches the vast majority of errors:

Completeness: required columns cannot arrive empty. A customer without an identifier or a sale without a value are red flags.
Uniqueness: keys cannot have duplicates, or totals inflate silently.
Ranges: values have to make sense, an age of 300 years or a negative sale should be stopped.
Referential integrity: a sale has to point to a customer and a product that actually exist.

The golden rule: stop, do not publish

The most important decision is not which checks to run, but what to do when one fails. The right answer is clear: when a critical check fails, the pipeline should stop and alert, never silently publish suspicious data. It is counterintuitive (nobody likes a pipeline that fails), but a pipeline that stops in time protects trust; one that publishes garbage destroys it. Failing loud and early is always better than corrupting slowly and in secret.

In practice: the Excel that reveals the problem

Imagine a finance team that, every month, exported the report to Excel and "fixed some numbers by hand" before presenting it. That parallel Excel was, in fact, a symptom: the pipeline delivered data nobody trusted. By integrating automatic validations at the source, completeness, uniqueness, ranges, errors started being caught on load, with an alert, instead of manually corrected at the end. The parallel Excel disappeared, not by order, but because it was no longer needed.

Quality as part of the pipeline, not as a patch

The big mindset shift is to stop seeing quality as a separate task and start seeing it as an integral part of the pipeline. Each load carries its own guarantees. When validation is built in, trust stops depending on luck or human vigilance, and becomes a property of the system. And in your organization: are data errors caught by the pipeline, or discovered by a manager looking at a number that does not add up?