Idempotency: pipelines you can run twice without breaking data

Every data engineer lives, sooner or later, the same nightmare: a pipeline failed in the middle of the night, someone re-ran it to recover, and in the morning the reports show sales doubled. There were no more sales — there was duplicated data because the pipeline ran twice. This problem has a name and a solution: it is called idempotency, and it is perhaps the most important and least-discussed property of a reliable pipeline.

What idempotency is

A process is idempotent when running it once or several times produces exactly the same result. Loading a day's data should leave the database in the same state whether you run that load once or five times by mistake. If the second load duplicates the data, the pipeline is not idempotent — and it is a time bomb waiting for the day someone has to repeat it.

Idempotency: pipelines you can run twice without breaking data

Why this matters more than it seems

In a perfect world, every pipeline ran once and never failed. In the real world, they fail — a connection drops, a source is late, a server restarts — and the natural response is to run it again. If the pipeline is not idempotent, each recovery creates a new problem. And since failures always happen at the worst moments, the duplication is discovered late, already contaminating reports and decisions. Idempotency is the safety net that makes recovery safe instead of dangerous.

The most common pattern: delete and reinsert

The simplest and most robust way to ensure idempotency is the "delete-insert" pattern by partition. Instead of blindly appending the new data, the pipeline first deletes the data for the period it is about to load (for example, today's) and only then inserts it. This way, running today's load twice always leaves exactly one set of today's data — the second load deletes what the first put in and puts back the same. Simple, predictable, repeat-proof.

The alternative: upsert by key

When there is no clean partition by period, you use "upsert": for each record, if it already exists (identified by a unique key), it is updated; if it does not exist, it is inserted. The result is the same — running twice does not duplicate, because the second pass merely overwrites what was already there. It requires reliable keys, but it gives idempotency even in flows where data arrives mixed.

Precautions that make the difference

Define the right granularity: the partition to delete and reinsert must match what the pipeline processes each time — not too much (you delete data you should not), not too little.
Make the operation atomic: the delete and insert should happen as a whole, so a failure midway does not leave the database in an inconsistent state.
Think about downstream data: if other processes read this data, the replacement must be invisible to them — they cannot catch the moment when the data is deleted but not yet reinserted.

A concrete case

A company had a nightly pipeline that appended the day's sales to the historical table. It worked for months — until the night the source was late, the pipeline failed midway, and the operations team, following procedure, re-ran it. By dawn, the previous day's sales appeared duplicated, and since the duplication was not obvious (the totals only looked like "a good day"), it took a while for someone to get suspicious. The fix involved identifying and deleting the duplicate records by hand, with the risk of deleting the wrong ones. After this scare, they rewrote the pipeline with the delete-insert pattern by day. The next time it failed and had to be repeated, absolutely nothing happened — the data stayed correct, with no intervention. The cost of making the pipeline idempotent paid for itself on the first calm recovery.

Idempotency is a mindset, not a trick

The real value is not in a specific technique, but in the habit of designing every pipeline thinking "what if this runs twice?". That question, asked at the start, changes the architecture for the better and saves sleepless nights later. Idempotent pipelines recover on their own, tolerate failures without drama and give those who operate them the peace of mind to repeat without fear. It is the difference between a fragile system and a robust one.

In practice

Do a simple exercise with your most critical pipeline: if you ran it now, again, over data it has already processed, what would happen? If the answer is "it would duplicate" or "I do not know", you have found your next improvement — and probably avoided a future nightmare. Do your pipelines survive being run twice, or do they live waiting for the day someone has to repeat them?