What is ETL and how to design your first data pipeline

João Barros 11 de July de 2023 2 min read

ETL stands for Extract, Transform, Load — the process that moves data from several sources into an analytical destination. Understanding the three steps is the basis for building any data pipeline.

Prerequisites

Knowledge of SQL and where your data lives (files, databases, APIs).
A destination for the data (a data warehouse or analytical database).
An ETL tool or a language such as Python.

Step 1: Extract

Read the data from the source as it is, without changing it. It can be a file, a table or an API:

SELECT id, cliente, valor, data
FROM origem.vendas
WHERE data >= '2023-01-01';

Step 2: Transform

Clean and adapt the data to the destination rules: fix types, remove duplicates, normalise text, compute columns. This is where most of the work lives.

Step 3: Load

Write the transformed data into the destination, usually incrementally (only what is new):

INSERT INTO dw.fact_vendas (cliente_key, valor, data_key)
SELECT cliente_key, valor, data_key
FROM staging.vendas_limpo;

Step 4: Orchestrate and schedule

A pipeline runs repeatedly. Schedule it (daily, hourly) and log each run so you know whether it succeeded.

Verify the result

Compare the row count in the source and the destination and check a few values by hand. The totals should match.

Conclusion

Extract, Transform, Load is a simple but powerful pattern. Once you master the basic flow, the next step is to make it incremental and monitored. Which data source will you integrate first?