(+351) 21 24 10006  ·  info@bconcepts.pt
Carnaxide, Lisbon
ETL

Idempotency in ETL: re-run safely without duplicating data

João Barros 04 de July de 2026 5 min read

An idempotent ETL pipeline is one you can run two, five or ten times in a row and always get the same final result, with no duplicate rows and no corrupted data. This property is what lets you sleep at night: when a process fails halfway and has to be reprocessed, idempotency guarantees that the second attempt does not ruin what the first one already loaded. In production, network failures, timeouts and restarts happen, and an idempotent pipeline turns those scares into a simple re-run. Below you will see, step by step, how to turn a "blind append" Load into an idempotent Load using a business key and a MERGE.

Prerequisites

  • A relational target (for example SQL Server, Azure SQL or PostgreSQL) where you create the final table.
  • Basic SQL: INSERT, UPDATE and, ideally, MERGE.
  • A source with a stable identifier per record (for example, the sale id).

Step 1: Understand why a simple Load is not idempotent

The most common pattern in the Load phase is a direct INSERT from staging into the target. The problem is that every run appends rows: if the pipeline fails after loading half of the sales and you run it again, the rows already loaded come in a second time and you end up with duplicated records. Picture a billing report counting the same sale twice: the error spreads across the whole analysis. An idempotent Load depends only on what is in the source, not on how many times it ran.

Idempotency in ETL: re-run safely without duplicating data
INSERT INTO vendas_dest (id_venda, cliente, total)
SELECT id_venda, cliente, total
FROM vendas_stg;

Step 2: Choose a stable business key

The business key is the column (or set of columns) that identifies a record uniquely and does not change over time, such as an id_venda. This key is what lets you decide, for each row, whether to insert a new record or update an existing one, instead of always inserting. If you need more than one column to guarantee uniqueness, use a composite key.

Step 3: Replace the INSERT with a MERGE (upsert)

MERGE compares the source with the target by the key: if the row already exists, it does an UPDATE; if it does not exist, it does an INSERT. Because it is a single atomic statement, it either runs completely or changes nothing. Running the same MERGE twice with the same source produces exactly the same final state, which is the practical definition of idempotency.

MERGE INTO vendas_dest AS d
USING vendas_stg AS s
   ON d.id_venda = s.id_venda
WHEN MATCHED THEN
   UPDATE SET d.cliente = s.cliente,
              d.total   = s.total
WHEN NOT MATCHED THEN
   INSERT (id_venda, cliente, total)
   VALUES (s.id_venda, s.cliente, s.total);
Tip: in PostgreSQL, the equivalent is INSERT ... ON CONFLICT ... DO UPDATE; on engines without MERGE, the delete-insert pattern from Step 5 always works.

Step 4: Make sure the source has no duplicates

MERGE fails if the source has the same key in several rows, because it cannot tell which one to use. So clean the duplicates in staging before the MERGE, keeping only the most recent version of each key with ROW_NUMBER().

WITH ranked AS (
   SELECT id_venda, cliente, total,
          ROW_NUMBER() OVER (
             PARTITION BY id_venda
             ORDER BY data_atualizacao DESC
          ) AS rn
   FROM vendas_raw
)
INSERT INTO vendas_stg (id_venda, cliente, total)
SELECT id_venda, cliente, total
FROM ranked
WHERE rn = 1;

Step 5: Partition load, an idempotent alternative

When you reprocess a whole slice of data (for example, one day), there is an even simpler pattern: delete that partition in the target and insert it again. Inside a transaction, the operation is atomic and can be repeated with no risk.

BEGIN TRANSACTION;

DELETE FROM vendas_dest
WHERE data_venda = '2026-07-04';

INSERT INTO vendas_dest (id_venda, cliente, total, data_venda)
SELECT id_venda, cliente, total, data_venda
FROM vendas_stg
WHERE data_venda = '2026-07-04';

COMMIT;

Verify the result

The real proof is simple: run the pipeline twice in a row and confirm that the row count and the totals do not change from the first run to the second. Then look for repeated keys in the target; if this query returns nothing, the Load is idempotent.

SELECT id_venda, COUNT(*) AS n
FROM vendas_dest
GROUP BY id_venda
HAVING COUNT(*) > 1;

Conclusion

With a business key, a MERGE (or a delete-insert per partition) and a duplicate cleanup in staging, your Load becomes safe to repeat as many times as you need. From here, wrap everything in a transaction, add logging, and combine idempotency with incremental loading for fast, reliable pipelines. Next time a pipeline fails halfway, ask yourself one question: can I simply run it again without fear? If the answer is yes, you already have an idempotent ETL.

Share: