Schema evolution: how to change the data structure without breaking everything

A company's data is never still. The business changes, and with it change the data that describes it: a new type of information about customers is added, a field that no longer makes sense stops being used, the way a transaction is recorded changes. Each of these changes in the structure of the data — called the schema — is natural and inevitable. The problem is that, if not well managed, a seemingly innocent change to a table's structure can break everything that depends on it downstream: pipelines that stop running, reports that go blank, analyses that fail. Managing these changes in a controlled way — called schema evolution — is one of the skills that distinguish a robust data operation from a fragile one.

The tension is at the heart of the problem: the data needs to evolve to keep up with the business, but that evolution threatens the stability of everything that consumes it. A frozen data structure that never changes would be stable but would quickly become misaligned with reality. A structure that changes carelessly would be flexible but would constantly break what depends on it. Schema evolution is the art of letting data change without those changes causing damage — of reconciling the need to evolve with the need not to break anything.

This article explains why schema changes are so dangerous, what types of change exist, and how to manage them so that data can evolve safely.

Why an innocent change breaks everything

A data table rarely lives in isolation. Typically, it is consumed by many things downstream: pipelines that read and transform it, reports that show it, models that use it, other tables that derive from it. Each of these consumers was built assuming the table has a certain structure — certain fields, with certain types. When that structure changes, those consumers can break, because the assumption they rested on is no longer true.

Schema evolution: how to change the data structure without breaking everything

What makes this particularly dangerous is the invisibility of the dependency. Whoever makes a change to a table rarely knows everything that depends on it — all the pipelines, reports and analyses that consume it, often built by other people, over time. So a change made with the best of intentions, to improve the data structure, can silently break a critical report on the other side of the company, and whoever made the change is not even aware. It is this chain of invisible dependencies that turns a seemingly simple change into a real risk.

Not all changes are equal

A fundamental distinction in schema evolution is between safe and dangerous changes. Some changes are compatible — they do not break what already exists. Adding a new field to a table, for example, is generally safe: consumers that do not know about the new field simply ignore it and keep working as before. These additive changes can be made with relative calm, because they do not take away anything someone might have been relying on.

Other changes are incompatible — they break what depends on them. Removing a field, changing its type, or changing its meaning are dangerous changes, because any consumer that depended on that field, as it was, stops working. Distinguishing these two types of change is the first step to managing schema evolution: compatible changes can be made with light care; incompatible ones require a much more careful process, because they will, by definition, break something if not managed.

The principles of safe evolution

Prefer additive changes: adding instead of altering or removing whenever possible, because adding rarely breaks what exists.
Deprecate before removing: marking a field as obsolete and giving consumers time to adapt before actually removing it.
Communicate the changes: warning those who depend on the data in advance, so they can prepare instead of being caught by surprise.
Validate automatically: having checks that detect dangerous schema changes before they reach production and break things.

The technique of deprecating before removing

One of the most valuable techniques in schema evolution is never removing something abruptly, but rather deprecating it first. When a field is no longer needed and you want to remove it, instead of deleting it right away — which would break everything that depended on it — you mark it as obsolete and communicate that it will be removed within a certain deadline. During that period, the field still exists, but everyone knows they should stop using it. Only after giving enough time for everyone to adapt is the field actually removed.

This approach turns a dangerous and abrupt change into a controlled and predictable transition. Instead of breaking everything at once and forcing emergency fixes, you give consumers the chance to adapt at their own pace, with advance notice. It is the difference between demolishing a building with people inside and giving them time to leave first. This patience has a cost — keeping the obsolete field for a while longer — but it avoids the chaos and incidents an abrupt removal would cause.

A concrete case

A company had a central customer table that was consumed by an impressive amount of things — dozens of reports, several pipelines, some models, and other tables that derived from it, many built by different teams over years. At some point, the business changed, and one of the fields in that table stopped making sense, while another needed its format changed. A first attempt to make these changes went badly: a team, to tidy the table, removed the obsolete field and altered the format of the other directly, without realizing the extent of the dependencies. The result was a day of chaos — several reports from other areas went blank or errored, pipelines failed, and nobody immediately understood why, because the change had been made to one table and the effects manifested somewhere completely different. After this scare, the company adopted a disciplined approach to schema evolution. For future changes, they followed clear principles. Additive changes — adding new fields — were made with relative calm. But for dangerous changes, like removing a field or changing its format, they adopted the deprecate-before-removing technique: they marked the field as obsolete, communicated the change to all teams well in advance, gave time for everyone to adapt, and only then removed it. They also added automatic validations that detected dangerous schema changes before they reached production. The next time a field needed to be removed, the process went without incident: the field was deprecated, teams were warned, they had weeks to adapt, and when the field was finally removed, nothing broke, because nobody depended on it anymore. The same change that had previously caused a day of chaos became a calm and predictable transition. The difference was not in avoiding the change — data has to evolve —, but in managing it in a controlled way instead of doing it abruptly.

Evolving without fear

The ultimate goal of schema evolution is to let a company's data keep up with the business without that necessary evolution becoming a source of fear and incidents. A data operation that manages schema changes well can evolve its data structure with confidence, knowing it has the processes to do so without breaking anything. One that manages them badly is stuck in a dilemma: either it freezes the data and becomes misaligned with reality, or it changes it and lives in constant crisis. Good schema evolution removes this dilemma, allowing safe change.

This capability connects directly to other mature data practices — data contracts, which formalize the promises about the data structure, and observability, which detects when something changes unexpectedly. They all share the same goal: making data robust and reliable even in a constantly changing world. Schema evolution is the piece that ensures the inevitable change of the data structure happens in a controlled way, and not as a series of unpleasant surprises.

In practice

If in your company changes to the data structure are made ad hoc, and from time to time a change to a table breaks reports or pipelines on the other side of the organization, you need a disciplined approach to schema evolution. Adopt simple principles: prefer adding to altering, deprecate before removing giving people time to adapt, communicate changes in advance, and automatically validate dangerous changes. Are the changes to your data structure managed in a controlled way, or do you live with the fear that the next change will break something important on the other side of the company?