Data observability: monitoring data health the way you monitor software

In modern software systems, no one waits for a server to fail to discover there was a problem. There are tools that continuously monitor the health of everything — the memory, the response speed, the errors — and alert at the first anomaly, often before users even notice. This practice is called observability, and it is one of the reasons the big online systems run so reliably. In the world of data, however, this culture is still rare: most companies only discover their data has a problem when someone notices a wrong number in a report — that is, when it is already too late. Data observability brings to the world of data the same continuous monitoring discipline software already has.

The idea is simple to state: instead of waiting for data problems to manifest in wrong reports and bad decisions, you continuously monitor the health of the data to detect them as soon as they arise. Just as a software observability system watches the health of applications, a data observability system watches the health of data — whether it arrived on time, whether it has the expected volume, whether its values make sense, whether anything changed suspiciously. It is a fundamental shift in posture: from reactive to proactive.

This article explains what data observability is, how it differs from data tests, and why it is becoming an essential piece of any data operation that takes itself seriously.

The problem of discovering errors late

Without observability, data problems are discovered the worst way: someone, typically a decision-maker, notices a number that does not make sense and raises the issue. By this time, the problem has existed for a while — perhaps days, perhaps weeks — and has already contaminated reports and influenced decisions. Worse still, finding the cause after the problem has spread is painful detective work, because you have to go back through the whole data journey to find where something went wrong.

Data observability: monitoring data health the way you monitor software

This reactive model has a cost that goes far beyond the error itself. Every time a bad number reaches a report, trust in all the data is undermined — people start doubting everything, even what is correct. And the data team lives in a permanent defensive position, firefighting problems it discovers late, instead of proactively ensuring data health. The absence of observability is not just inefficient; it is a constant source of trust erosion.

What data observability watches

Data observability rests on the continuous monitoring of several signals that, together, indicate the health of the data. Each signal is like a sensor that detects a category of problem, and their joint watch gives a complete picture of whether something is going wrong.

Freshness: did the data arrive on time? A table that should refresh overnight and did not is one of the most important alarm signals.
Volume: is the number of records what is expected? A table that suddenly has half the rows, or double, signals a problem at the source.
Value distribution: are the values still within what is normal, or has something changed suspiciously, suggesting an error?
Schema: has the data structure changed — a column that disappeared, a type that changed — without anyone warning?

Observability is not the same as data tests

It is natural to confuse data observability with data tests, because both serve to ensure quality — but there is an important difference. Data tests verify specific rules we explicitly define: we know this column should have no duplicates, and we create a test to confirm it. They are excellent for catching the problems we can anticipate. Observability, on the other hand, continuously monitors the general behavior of the data and detects anomalies — deviations from what is normal — even for problems we never anticipated.

This distinction matters because the two complement each other. Tests protect against known problems, those whose rules we can write upfront. Observability protects against the unknown ones, the problems we did not foresee but that manifest as strange data behavior — a table that did not refresh, a volume that dropped, a distribution that changed. Together, they cover both the risks we anticipate and the ones that would catch us by surprise. Having only tests leaves an organization exposed to everything it did not think to test.

The advantage of detecting early

All the value of observability rests on a simple idea: detecting a data problem as soon as it arises is infinitely better than discovering it after it has spread. A problem caught at the source, before contaminating reports, is cheap to fix and easy to diagnose — you know exactly what failed and where. The same problem discovered weeks later, in a report, has already cost wrong decisions, already undermined trust, and diagnosing its origin is a nightmare. Observability turns the costly and embarrassing into the cheap and discreet.

There is also an effect on the relationship between the data team and the rest of the organization. When it is the data team itself that detects and communicates the problems — "we noticed this table did not refresh and we are fixing it" — instead of waiting for a decision-maker to discover them in a report, the dynamic changes completely. The team goes from a defensive position, always justifying errors discovered by others, to a position of control and confidence, continuously demonstrating that it is watching the health of the data. Observability not only reduces problems but transforms the perception of data reliability.

A culture, not just a tool

As with many good data practices, observability is as much a matter of culture as of technology. Having the tools that monitor data health is the start, but the full value is only realized when the organization adopts a proactive posture: when alerts are taken seriously and acted on quickly, when the team sees watching data health as part of its normal work, not as an extra task. An observability tool whose alerts no one follows protects no one.

This cultural change — from reacting to problems discovered by others to proactively watching the health of one's own data — is one of the transitions that mark the maturity of a data operation. It is the same evolution software systems made years ago, when they realized that waiting for something to fail was an unsustainable model, and started monitoring everything continuously. The world of data is now walking that same path.

A concrete case

A company depended on several reports fed by data pipelines that ran overnight, integrating data from several sources. For a long time, the data team lived in a reactive and stressful cycle: from time to time, one of the pipelines failed or a source did not deliver the data correctly, but no one noticed until morning, when a user opened a report and saw that the numbers were wrong or that part of the data was missing. By then, the team went into crisis mode — it had to quickly find which pipeline had failed and why, fix it, and reprocess, all while users waited and trust in the data took another blow. The company decided to implement data observability. They started continuously monitoring the health of the most important tables: whether each had refreshed on time, whether it had the expected volume, whether the values were within normal. When something fell outside the expected, the team received an immediate alert. The transformation was profound. They stopped discovering problems through users and started discovering them, often, in the middle of the night, as soon as they happened — before any report was affected. A table that did not refresh generated an alert at three in the morning, and the problem was solved before the start of the workday, without any user ever seeing wrong data. The team left permanent crisis mode and entered a calm mode of control. And, perhaps most importantly, the organization's trust in the data rose noticeably, because the visible incidents — the wrong numbers that appeared in reports — practically disappeared. The value came not from the data becoming perfect, but from the problems starting to be caught before they did harm.

In practice

If in your company data problems are discovered when a user notices a wrong number in a report, you are in a reactive model that costs dearly in trust and stress. Data observability offers a proactive alternative: continuously monitoring the health of the data — freshness, volume, values — to detect problems as soon as they arise, before they contaminate reports and decisions. Start with the most critical tables, the ones that feed the most important reports. Are your company's data problems caught by your team before they do harm, or discovered by users when it is already too late?