How to build an ETL pipeline in Python with pandas
Moving data from a raw file into a database that's ready to query is one of the most common tasks in any data project. An ETL (Extract, Transform, Load) pipeline does exactly that, in an organized and repeatable way. Here you'll build, from start to finish, a simple ETL pipeline in Python that reads a sales CSV, cleans and enriches the data, and loads the result into a SQLite database.
Prerequisites
- Python 3.9 or higher installed.
- The
pandaslibrary (install it withpip install pandas). - Basic Python knowledge (variables and functions).
- A code editor or a terminal. SQLite already ships with Python through the
sqlite3module.
Step 1: Prepare the environment and the data
Start by installing pandas and creating a sample file. This short script generates a vendas.csv with a few rows, including one missing value on purpose, so we can see how to handle it later.

pip install pandas
import pandas as pd
dados = {
"data": ["2026-06-01", "2026-06-01", "2026-06-02", None],
"produto": ["Teclado", "Rato", "Monitor", "Teclado"],
"quantidade": [3, 5, 2, 1],
"preco_unitario": [25.0, 12.5, 150.0, 25.0],
}
pd.DataFrame(dados).to_csv("vendas.csv", index=False)
Step 2: Extract — read the source data
The extract stage reads data from the source. With pandas, reading a CSV is a single line. Print the first rows to confirm the data arrived as you expected before transforming it.
import pandas as pd
df = pd.read_csv("vendas.csv")
print(df.head())
Step 3: Transform — clean and enrich
The transform stage is the heart of the pipeline. Here we remove invalid rows, convert the columns to the right data types, and create a new column with the total value of each sale. This is also where, in a real project, you would apply your business rules.
# Remover linhas sem data
df = df.dropna(subset=["data"])
# Converter para os tipos corretos
df["data"] = pd.to_datetime(df["data"])
df["quantidade"] = df["quantidade"].astype(int)
# Criar uma coluna derivada
df["total"] = df["quantidade"] * df["preco_unitario"]
Step 4: Load — write to the database
The load stage writes the cleaned data to the destination. The pandas to_sql method creates the table and inserts the rows into a SQLite database. The if_exists="replace" parameter recreates the table on each run, which is handy while you're testing.
import sqlite3
conn = sqlite3.connect("vendas.db")
df.to_sql("vendas", conn, if_exists="replace", index=False)
conn.commit()
conn.close()
Step 5: Put it all together in one pipeline
With the three stages ready, the final step is to wrap them in a reusable function. That way you can run the whole process with a single call and reuse it for other files.
def etl(origem_csv, destino_db):
df = pd.read_csv(origem_csv)
df = df.dropna(subset=["data"])
df["data"] = pd.to_datetime(df["data"])
df["total"] = df["quantidade"] * df["preco_unitario"]
conn = sqlite3.connect(destino_db)
df.to_sql("vendas", conn, if_exists="replace", index=False)
conn.close()
return len(df)
linhas = etl("vendas.csv", "vendas.db")
print(f"Carregadas {linhas} linhas")
Check the result
To confirm everything worked, connect to the database again and query the table. You should see three rows (the row with no date was removed) and the new total column filled in.
import sqlite3
import pandas as pd
conn = sqlite3.connect("vendas.db")
print(pd.read_sql("SELECT * FROM vendas", conn))
conn.close()
If the query returns the rows with the data, produto, quantidade, preco_unitario and total columns, your ETL pipeline is working.
Conclusion
You've built a complete ETL pipeline: you extracted data from a CSV, transformed it with pandas, and loaded it into a query-ready SQLite database. From here, the natural next steps are to schedule the script (with cron on Linux or Task Scheduler on Windows), add incremental loads to process only new rows, and, as the volume grows, move to an orchestration tool such as Azure Data Factory. Which data source will you transform next in your first pipeline?