How to Implement SCD Type 2 in a Data Warehouse
When a customer moves to a new city or a product changes category, a well-designed Data Warehouse does not overwrite the old value: it keeps the full history, so that reports about the past stay correct. The most common technique to achieve this is called SCD Type 2 (Slowly Changing Dimension), and it is simpler than it looks. Below you will build a customer dimension that records every version of every customer over time, using only SQL.
Prerequisites
- Access to a relational database or Data Warehouse (SQL Server, Azure SQL, PostgreSQL, Snowflake…).
- Basic SQL knowledge:
CREATE TABLE,INSERTandUPDATE. - A source (staging) table with the current snapshot of customers, here called
staging_cliente.
Step 1: Understand what changes in an SCD Type 2
A dimension stores the descriptive attributes of an entity — for example a customer's name and city. Imagine customer number 1 moves from Porto to Lisbon. In an SCD Type 1 you replace 'Porto' with 'Lisbon' and lose the previous information; any report about the past becomes wrong. In an SCD Type 2 you create a new row for the new city and keep the old row intact, marking the period during which each version was valid.

To do this you need three control columns: a start date (valido_de), an end date (valido_ate) and a flag for the current version (atual).
Step 2: Create the dimension table
Besides the customer attributes, the table has a surrogate key that identifies each version, and the business key cliente_id, which repeats across the several versions of the same customer. The surrogate key matters because it is what the fact tables will point to: this way each sale is linked to exactly the version of the customer that existed on the sale date.
CREATE TABLE dim_cliente (
cliente_sk INT IDENTITY(1,1) PRIMARY KEY,
cliente_id INT NOT NULL,
nome VARCHAR(100) NOT NULL,
cidade VARCHAR(100) NOT NULL,
valido_de DATE NOT NULL,
valido_ate DATE NULL,
atual BIT NOT NULL
);
The validity interval is half-open: it goes from valido_de (inclusive) to valido_ate (exclusive). The current version has valido_ate set to NULL and atual = 1.
Step 3: Run the initial load
On the first run, every customer comes in as the current version. Run this step only once, to bootstrap the dimension from staging.
INSERT INTO dim_cliente (cliente_id, nome, cidade, valido_de, valido_ate, atual)
SELECT cliente_id, nome, cidade, CAST(GETDATE() AS DATE), NULL, 1
FROM staging_cliente;
Step 4: Close the versions that changed
From here on, steps 4 and 5 form the recurring load, the one you run on every update. First we close the current version of the customers whose name or city changed: we fill in the end date and set atual to 0.
UPDATE d
SET d.valido_ate = CAST(GETDATE() AS DATE),
d.atual = 0
FROM dim_cliente d
JOIN staging_cliente s ON s.cliente_id = d.cliente_id
WHERE d.atual = 1
AND (s.nome <> d.nome OR s.cidade <> d.cidade);
If any attribute allows NULL values, wrap the comparison with ISNULL (or COALESCE) so that changes to and from NULL are detected too.
Step 5: Insert the new version
Now we insert a new row for two cases: the customers we just closed in the previous step and brand-new customers. Both have one thing in common — right now they have no row with atual = 1. The LEFT JOIN finds exactly those.
INSERT INTO dim_cliente (cliente_id, nome, cidade, valido_de, valido_ate, atual)
SELECT s.cliente_id, s.nome, s.cidade, CAST(GETDATE() AS DATE), NULL, 1
FROM staging_cliente s
LEFT JOIN dim_cliente d
ON d.cliente_id = s.cliente_id AND d.atual = 1
WHERE d.cliente_id IS NULL;
Check the result
Query the history of a customer that changed city. You should see two rows: the old one with atual = 0 and a filled-in end date, and the new one with atual = 1 and a NULL end date.
SELECT cliente_id, nome, cidade, valido_de, valido_ate, atual
FROM dim_cliente
WHERE cliente_id = 1
ORDER BY valido_de;
Also run a sanity check: each customer can only have one current version. The following query should return no rows.
SELECT cliente_id, COUNT(*) AS versoes_atuais
FROM dim_cliente
WHERE atual = 1
GROUP BY cliente_id
HAVING COUNT(*) > 1;
Conclusion
With these five steps you have a dimension that never loses history and that answers questions like 'which city was this customer in when they made the purchase'. The next step is to automate steps 4 and 5 in a scheduled procedure and, when the dimension has many attributes, to compare a hash of the columns instead of field by field, to keep the code shorter. Which of the dimensions in your model would benefit most from keeping history?