Data testing: how to ensure what enters the warehouse is correct

In software development, no one in their right mind ships code to production without testing it. Automated tests are written that check, on every change, whether everything still works; it is a basic discipline, taken for granted by everyone. And yet, in the world of data, an equivalent practice is still rare: every day, data is loaded into warehouses and used for important decisions without anyone systematically checking whether it is even correct. It is assumed to be fine, until the day a report shows an absurd number and someone asks, too late, "is this right?". Data tests are the answer to this blind spot — the discipline of automatically verifying that what enters the warehouse is reliable, before it contaminates everything that follows.

The absence of this practice is one of the biggest sources of distrust in a company's data. A single error that goes unnoticed — a duplicated value, a missing field, a number in the wrong unit — propagates through reports, feeds decisions and undermines trust in the entire data platform. Once someone finds one bad number, they start doubting all of them. Data tests exist precisely to catch these problems at the source, before they show up in the worst possible way.

This article is not about a specific tool. It is about a mindset shift: treating data with the same verification discipline we already apply to software, because the consequences of wrong data are as serious — or more — than those of buggy code.

Why data needs tests as much as code

There is a dangerous difference between software and data that explains why data tests are so neglected. When code has a bug, it often breaks visibly — the program fails, throws an error, stops. When data is wrong, the system keeps running happily: the pipeline runs, the report appears, the charts draw. Only the numbers are wrong. The error in data is silent, and it is that silence that makes it so dangerous — it goes unnoticed for a long time, contaminating decisions, until the accumulated damage forces you to look back.

Data testing: how to ensure what enters the warehouse is correct

This characteristic changes everything. With code, the absence of a visible error is a reasonable sign it is working. With data, the absence of a visible error says nothing — everything may be fine or everything may be wrong, and there is no way to know without actively checking. That is why trusting that "the data is fine because the pipeline did not fail" is a dangerous illusion. The pipeline can run perfectly and load completely wrong data.

What a data test verifies

A data test is an automatic check of a rule the data must follow. Before the data is accepted into the warehouse or used downstream, you test whether it respects what is expected of it. If a rule fails, the system warns — and, ideally, blocks the load — instead of letting suspicious data through. It is the equivalent of a quality control at the entrance of a factory: nothing moves to the production line without passing inspection.

These rules translate, into concrete checks, what we know must be true about the data. Some are universal — a unique key should have no duplicates, a mandatory field should not be empty. Others come from business knowledge — a sale cannot have a negative value, an age cannot be 200, today's total cannot be ten times yesterday's without a reason. Every rule you encode into a test is a net that catches a class of errors before they cause damage.

The types of test that are worth the most

Uniqueness: checking that keys have no duplicates — the number one cause of inflated totals nobody notices.
Not-null: confirming that mandatory fields are filled, so as not to lose records or distort averages silently.
Ranges and valid values: ensuring numbers fall within what makes sense — no negative prices, no percentages above a hundred, no dates in the future where they should not be.
Consistency and volume: catching suspicious changes — a table that suddenly has half the rows, a total that jumps inexplicably — that signal a problem at the source.

The power of failing early and loudly

The great virtue of data tests is not only that they catch errors — it is that they catch them at the right time and place. An error detected at the entrance, before being loaded, is cheap to fix: you know exactly what failed and why, and nothing downstream has yet been contaminated. The same error discovered weeks later, in a board report, is expensive and embarrassing: it has already influenced decisions, already undermined trust, and tracing its origin amid everything that came after is a nightmare. The rule is clear: the earlier you catch a data problem, the cheaper it is to fix — and tests are what let you catch it as early as possible.

There is an additional value in failing "loudly". When a test fails and blocks the load, the problem becomes impossible to ignore — someone has to solve it before the data moves on. This is infinitely better than the silent alternative, where the wrong data passes and the problem is only discovered when it causes damage. A good testing system turns silent, expensive errors into loud, cheap alerts.

The mistake of testing everything (or nothing)

There are two ways to fail with data tests. The first is having none, trusting luck — the starting point of most companies. The second, less obvious, is overkill: trying to test everything, creating hundreds of rules nobody maintains, and generating so many alerts that people start ignoring them, including the important ones. A testing system that cries all the time over harmless variations is as useless as one that never cries — in both cases, the alerts stop being taken seriously.

The balance is in testing what matters: the rules whose breach would cause real damage, on the most critical data. Start with the few high-value tests — the keys of the most used data, the fields that feed the most important reports — and grow from there, with judgment. A few well-chosen tests, that rarely fail but when they fail really mean something, are worth more than hundreds nobody can follow.

A concrete case

A company discovered, the hard way, the cost of not testing data. For weeks, a sales report used by the board showed slightly inflated numbers, without anyone suspecting — the values seemed plausible, just "a good period". The cause was a problem at the source that, at some point, started loading some transactions in duplicate. Since nothing checked the uniqueness of the keys, the duplicates passed silently into the warehouse and inflated the totals. The problem was only discovered when someone, by chance, cross-checked the report with another source and noticed the discrepancy — and, by then, several decisions had already been made based on wrong numbers. After this painful episode, the team introduced data tests at the critical points: a uniqueness test on the keys of the main fact tables, range tests on the sales values, and a volume check that flagged abnormal jumps in the number of rows. Weeks later, when a similar problem reappeared at the source, the uniqueness test failed immediately and blocked the load — the problem was solved the same day, before it even reached a report. The cost of setting up those few tests paid for itself completely the first time they caught an error that would otherwise have contaminated decisions again. The company learned that trust in data is not assumed — it is verified.

Trust earned, not assumed

At heart, data tests are an expression of a simple truth: trust in data is not a natural state, it is something earned and maintained with work. A data platform without tests asks people to trust it on faith; one with tests earns that trust through proof, continuously demonstrating that what enters is verified. And since trust is the most valuable and most fragile asset of any data platform, the tests that sustain it are one of the best-return investments in all of data engineering.

Adopting this discipline is, more than a technical matter, a cultural change: moving from "we hope the data is fine" to "we know we verified that it is". That shift in posture is what separates organizations where data is truly a reliable foundation from those where it is a permanent source of doubt.

In practice

If in your company data is loaded and used without anything systematically checking whether it is correct, you have a blind spot that sooner or later will cost dearly. Start small and with the most important: pick the most critical tables and add half a dozen high-value tests — uniqueness of keys, mandatory fields filled, values within the reasonable. Those few tests will catch most of the errors that today go unnoticed. Is the data that feeds your most important decisions being verified before you use it, or are you simply trusting that it is fine?