(+351) 21 24 10006  ·  info@bconcepts.pt
Carnaxide, Lisbon
ETL

How to build an ETL pipeline in Python with pandas

João Barros 04 de July de 2026 4 min read

Moving data from a raw file into a database that's ready to query is one of the most common tasks in any data project. An ETL (Extract, Transform, Load) pipeline does exactly that, in an organized and repeatable way. Here you'll build, from start to finish, a simple ETL pipeline in Python that reads a sales CSV, cleans and enriches the data, and loads the result into a SQLite database.

Prerequisites

  • Python 3.9 or higher installed.
  • The pandas library (install it with pip install pandas).
  • Basic Python knowledge (variables and functions).
  • A code editor or a terminal. SQLite already ships with Python through the sqlite3 module.

Step 1: Prepare the environment and the data

Start by installing pandas and creating a sample file. This short script generates a vendas.csv with a few rows, including one missing value on purpose, so we can see how to handle it later.

How to build an ETL pipeline in Python with pandas
pip install pandas
import pandas as pd

dados = {
    "data": ["2026-06-01", "2026-06-01", "2026-06-02", None],
    "produto": ["Teclado", "Rato", "Monitor", "Teclado"],
    "quantidade": [3, 5, 2, 1],
    "preco_unitario": [25.0, 12.5, 150.0, 25.0],
}
pd.DataFrame(dados).to_csv("vendas.csv", index=False)

Step 2: Extract — read the source data

The extract stage reads data from the source. With pandas, reading a CSV is a single line. Print the first rows to confirm the data arrived as you expected before transforming it.

import pandas as pd

df = pd.read_csv("vendas.csv")
print(df.head())

Step 3: Transform — clean and enrich

The transform stage is the heart of the pipeline. Here we remove invalid rows, convert the columns to the right data types, and create a new column with the total value of each sale. This is also where, in a real project, you would apply your business rules.

# Remover linhas sem data
df = df.dropna(subset=["data"])

# Converter para os tipos corretos
df["data"] = pd.to_datetime(df["data"])
df["quantidade"] = df["quantidade"].astype(int)

# Criar uma coluna derivada
df["total"] = df["quantidade"] * df["preco_unitario"]

Step 4: Load — write to the database

The load stage writes the cleaned data to the destination. The pandas to_sql method creates the table and inserts the rows into a SQLite database. The if_exists="replace" parameter recreates the table on each run, which is handy while you're testing.

import sqlite3

conn = sqlite3.connect("vendas.db")
df.to_sql("vendas", conn, if_exists="replace", index=False)
conn.commit()
conn.close()

Step 5: Put it all together in one pipeline

With the three stages ready, the final step is to wrap them in a reusable function. That way you can run the whole process with a single call and reuse it for other files.

def etl(origem_csv, destino_db):
    df = pd.read_csv(origem_csv)
    df = df.dropna(subset=["data"])
    df["data"] = pd.to_datetime(df["data"])
    df["total"] = df["quantidade"] * df["preco_unitario"]

    conn = sqlite3.connect(destino_db)
    df.to_sql("vendas", conn, if_exists="replace", index=False)
    conn.close()
    return len(df)

linhas = etl("vendas.csv", "vendas.db")
print(f"Carregadas {linhas} linhas")

Check the result

To confirm everything worked, connect to the database again and query the table. You should see three rows (the row with no date was removed) and the new total column filled in.

import sqlite3
import pandas as pd

conn = sqlite3.connect("vendas.db")
print(pd.read_sql("SELECT * FROM vendas", conn))
conn.close()

If the query returns the rows with the data, produto, quantidade, preco_unitario and total columns, your ETL pipeline is working.

Conclusion

You've built a complete ETL pipeline: you extracted data from a CSV, transformed it with pandas, and loaded it into a query-ready SQLite database. From here, the natural next steps are to schedule the script (with cron on Linux or Task Scheduler on Windows), add incremental loads to process only new rows, and, as the volume grows, move to an orchestration tool such as Azure Data Factory. Which data source will you transform next in your first pipeline?

Share: