(+351) 21 24 10006  ·  info@bconcepts.pt
Carnaxide, Lisbon
ETL

Incremental load in ETL: load only new or changed rows

João Barros 04 de July de 2026 4 min read

Reloading the whole table on every run is fine when there is little data, but it becomes slow and expensive as the source grows — and it forces you to reprocess millions of rows that never changed. An incremental load in an ETL solves exactly this: it processes only the rows that are new or changed since the last run, saving time, compute and money. The simplest and most reliable pattern for this uses a watermark column, and that is what you will build next, step by step.

Prerequisites

  • A source table with an always-increasing column — a modification timestamp (for example data_modificacao) or an incremental ID.
  • A destination table in your data warehouse (SQL Server, PostgreSQL or another).
  • Permission to create a small control table.
  • Basic SQL knowledge (INSERT, UPDATE, MERGE).

Step 1: Choose the watermark column

The watermark is the column that tells you how far you have already loaded. To work without gaps, it must always grow and never be reused or moved backwards. The two most common choices are a modification date/time column (like data_modificacao) or an incremental identifier generated by the database. A good watermark avoids two classic problems: skipping records (gaps) and loading rows that were already processed (duplicates). In the example, the source is the table vendas(id, cliente, total, data_modificacao) and the watermark is data_modificacao.

Incremental load in ETL: load only new or changed rows

Step 2: Create the control table

You need to store, between runs, the last value already processed. A small control table is enough for that. The first time, initialize it with an old date so that the initial load brings in every row.

CREATE TABLE etl_control (
    tabela          VARCHAR(100) PRIMARY KEY,
    last_watermark  DATETIME2 NOT NULL
);

INSERT INTO etl_control (tabela, last_watermark)
VALUES ('vendas', '1900-01-01');

Step 3: Read the watermark and capture the new one

At the start of each run, read the last processed value and also store the current maximum from the source. Capturing the maximum before loading avoids losing rows that arrive during the process.

DECLARE @last_watermark DATETIME2 =
    (SELECT last_watermark FROM etl_control WHERE tabela = 'vendas');

DECLARE @new_watermark DATETIME2 =
    (SELECT MAX(data_modificacao) FROM vendas);

Step 4: Load only the new data with MERGE

Now select only the rows whose data_modificacao is greater than the last watermark and apply them to the destination. MERGE performs an upsert in a single statement: it updates the rows that already exist and inserts the new ones, matching on the key. Doing it all in one command is more efficient and keeps the logic clearer than splitting into manual UPDATE and INSERT.

MERGE INTO dw_vendas AS destino
USING (
    SELECT id, cliente, total, data_modificacao
    FROM vendas
    WHERE data_modificacao > @last_watermark
) AS origem
ON destino.id = origem.id
WHEN MATCHED THEN
    UPDATE SET destino.cliente = origem.cliente,
               destino.total   = origem.total,
               destino.data_modificacao = origem.data_modificacao
WHEN NOT MATCHED THEN
    INSERT (id, cliente, total, data_modificacao)
    VALUES (origem.id, origem.cliente, origem.total, origem.data_modificacao);
Tip: always use > (greater than) and not >= in the watermark comparison, so you do not reprocess the last row already loaded.

Step 5: Update the watermark

After the load finishes successfully, save the new value in the control table. On the next run, only records more recent than this moment will be processed.

UPDATE etl_control
SET last_watermark = @new_watermark
WHERE tabela = 'vendas';

Verify the result

The best way to confirm that the incremental load works is to run the pipeline twice in a row. On the first run it brings in every row; on the second, with no new data in the source, it should process zero rows. To test thoroughly, change or insert a single record in the source and run again: only that row should reach the destination. You can check the count with a simple query:

SELECT COUNT(id) AS total_destino FROM dw_vendas;

If the number grows by exactly what you expect on each run, the pattern is working as it should.

Conclusion

With a watermark column, a control table and a MERGE, you have turned a heavy full load into a fast, cheap incremental load. From here, you can wrap the steps in a transaction to guarantee consistency, log the number of processed rows for auditing, and handle source deletions with a soft delete flag or with Change Data Capture (CDC). Which column in your source will you use as the watermark — a modification date or an incremental ID?

Share: