How to validate data quality in an ETL pipeline

João Barros 04 de July de 2026 4 min read

An ETL pipeline that loads bad data is worse than having no pipeline at all: the errors spread silently into your reports and decisions. Empty emails, impossible ages or repeated customers end up in the final table without anyone noticing. Data quality validation solves this by checking each row before the load, letting through only the ones that meet your rules. It is a simple step to add with Python and pandas, and it makes a big difference in how much you can trust your data.

Prerequisites

Python 3.9 or later installed on your computer.
The pandas library installed (pip install pandas).
A sample CSV file to validate (we use clientes.csv).
A basic understanding of the ETL flow: extract, transform and load.

Step 1: Extract the data and define the rules

The first step is to read the source data into a DataFrame and write out, explicitly, the rules that each row must follow. Keeping the rules together in one place makes the pipeline easy to maintain and to audit: if the age rule changes tomorrow, you edit a single number. In our example we have a customers table and four quality rules: the email cannot be empty, the age must be between 18 and 120, the country must belong to an allowed list, and the id must be unique.

import pandas as pd

# Extrair (E): ler os dados de origem
df = pd.read_csv("clientes.csv")

# Lista de valores permitidos para o pais
PAISES_VALIDOS = ["PT", "ES", "FR", "DE"]

Step 2: Write the validation function

Instead of deleting invalid rows right away, we note the reason for each failure. This keeps a clear record of what went wrong, which helps fix the source later. The function below takes a row and returns a list of every problem found in it — note that we collect all the errors, not just the first one. That way, whoever fixes the data sees every problem at once instead of discovering them one by one.

def validar_linha(linha):
    problemas = []
    if pd.isna(linha["email"]) or str(linha["email"]).strip() == "":
        problemas.append("email vazio")
    if pd.isna(linha["idade"]) or linha["idade"] < 18 or linha["idade"] > 120:
        problemas.append("idade invalida")
    if linha["pais"] not in PAISES_VALIDOS:
        problemas.append("pais nao permitido")
    return problemas

Step 3: Apply the rules and split the rows

Now we apply the function to every row with apply and axis=1, then add the duplicate-id check. Unlike the other rules, detecting duplicates means comparing each id against all the others, so it runs over the whole column rather than row by row. The result is an erros column holding the problem text for each row. Rows with no errors are the valid ones; the rest go to quarantine.

# Erros por linha (regras individuais)
df["erros"] = df.apply(validar_linha, axis=1)

# Regra que envolve a coluna toda: id duplicado
duplicados = df["id"].duplicated(keep=False)
df.loc[duplicados, "erros"] = df.loc[duplicados, "erros"].apply(
    lambda lista: lista + ["id duplicado"]
)

# Transformar a lista de erros em texto
df["erros"] = df["erros"].apply(lambda lista: ", ".join(lista))

# Separar validas de invalidas
validas = df[df["erros"] == ""].copy()
invalidas = df[df["erros"] != ""].copy()

Step 4: Load the valid rows and store the invalid ones

In the load (L) phase we send only the valid rows to the destination. The invalid ones are not thrown away: they are written to a quarantine file with the erros column, so someone can review and fix them. Here we use CSV files to keep the example simple, but the destination could be a database or a data warehouse.

# Carregar (L): so as linhas validas
validas.drop(columns=["erros"]).to_csv("clientes_limpos.csv", index=False)

# Quarentena: linhas invalidas com o motivo
invalidas.to_csv("clientes_quarentena.csv", index=False)

print(f"Validas: {len(validas)} | Invalidas: {len(invalidas)}")

Verify the result

To confirm the validation worked, open the clientes_quarentena.csv file and check that each rejected row has an explanation in the erros column. The number of valid rows plus the number of invalid rows should equal the total read at the start — the assert line confirms this automatically and fails if the numbers do not add up.

# Verificacao rapida: as contas fecham?
assert len(validas) + len(invalidas) == len(df)
print(invalidas[["id", "erros"]])

Conclusion

With just a few lines of code you now have an ETL pipeline that trusts but verifies: good rows move on to the destination and doubtful ones stay in quarantine with the reason attached. From here you can add new rules (email format, future dates, negative values), turn this check into a reusable step, or log quality metrics on every run. Which quality rule matters most for your data — and are you already validating it?