Change Data Capture (CDC): capturing only what changes, without overloading sources

Imagine you have a table with ten million customers and you want to keep it synchronized in your data warehouse. The naive approach is to copy all ten million rows every night. It works — until it stops working. As the data grows, that full copy takes longer and longer, weighs more and more on the source database, and wastes resources moving millions of rows that changed nothing. There is a much better way: capture only what changed. It is called Change Data Capture, or CDC.

CDC is one of those techniques that look like a technical detail but that, in practice, decide whether your data architecture scales or chokes. Understanding the problem it solves and the ways to implement it is essential for anyone building pipelines that have to keep up with growing data — that is, practically everyone.

The full-copy problem

Reloading everything on each update is simple to program and, with little data, perfectly acceptable. The problem is that it does not scale. If out of the ten million customers only a thousand changed since yesterday, copying all ten million is doing ten thousand times more work than needed. That waste translates into update windows that stretch deep into the night, source databases overloaded at the worst moment, and processing costs that rise without value rising with them.

Change Data Capture (CDC): capturing only what changes, without overloading sources

There is also a subtler problem: the full copy only gives you the current state, not the history. If a customer changed address three times since the last copy, you only see the last one — you lost the intermediate changes. For many analyses that does not matter; for others, the history of changes is precisely what matters. CDC, by capturing each change, preserves that history the full copy throws away.

The core idea: track changes, not state

CDC flips the logic. Instead of asking "how is everything now?" and copying the whole result, it asks "what changed since last time?" and moves only those changes — the inserts, updates and deletes. Since the number of changes in a period is typically a tiny fraction of the total, the volume to move is small, the update is fast, and the source barely feels the weight. It is the difference between moving house every day and just bringing what is new.

The ways to capture changes

By timestamp: if each row has a "last changed" field, you just ask for the ones that changed since the last run. Simple, but it depends on the column existing and always being updated — and it does not catch deletes.
By comparison: comparing the current state with a previous copy to detect differences. It always works, but it is heavy, which partly defeats the purpose.
From the database log: the most powerful way. Databases internally record every change in a transaction log; log-based CDC reads that record and reconstructs the changes without even touching the tables — minimal impact on the source, catches everything (including deletes), almost in real time.

Why log-based CDC is the gold standard

Reading the transaction log is elegant because it leverages something the database already does anyway to guarantee its own consistency. You do not need to add columns, you do not need to overload the source with comparisons, and no change escapes — not even the deletes, which the other techniques usually miss. In exchange for this power, it requires more configuration and access to the database's internal mechanisms, which makes it more complex to set up. It is the standard for large volumes and low-latency requirements, but for simple cases the lighter approaches are enough.

A concrete case

A company had a nightly pipeline that copied several large tables in full to the data warehouse. At first, it ran in twenty minutes. As the data grew over two years, that window stretched to more than three hours — and started colliding with the start of the workday, a time when the source database was already being used for operations and could not take the extra load. Morning reports were delayed, and the team lived firefighting an update that no longer fit in the night. The solution was not to buy bigger hardware — it was to switch to log-based CDC. Instead of copying everything, the pipeline started moving only each table's changes, which were a small fraction of the total. The update window dropped from three hours to a few minutes, the source database stopped feeling the weight, and the reports were ready well before the start of the day again. The same hardware, a different technique, a problem solved for good.

When the complexity is not worth it

CDC shines with large volumes and a need for freshness. But if you have small tables that fit in a fast full copy, adding CDC is complexity with no return — the simple full copy is easier to build and maintain. Like almost everything in data engineering, the right technique depends on the problem: using CDC "because it is modern", without the scale that justifies it, is solving a problem you do not have at the cost of complicating what already worked.

In practice

If your update windows are stretching and the source suffers with ever-heavier full copies, CDC is probably the answer you are avoiding. Start by identifying the largest and most costly tables to reload, and evaluate moving only what changes in those. Most "the update no longer fits in the night" problems are solved this way. Are your large tables being copied in full every night, when only a fraction of them changes?