How to read and clean CSV data with pandas in Python

João Barros 16 de May de 2023 2 min read

Working with data almost always starts with a CSV file. The pandas library in Python is the fastest way to read, inspect and clean it before any analysis.

Prerequisites

Python 3.9 or later installed.
The pandas library: pip install pandas.
A sample CSV file (for example vendas.csv).

Step 1: Read the CSV

Import pandas and load the file into a DataFrame:

import pandas as pd

df = pd.read_csv("vendas.csv")
print(df.head())

The head() method shows the first five rows so you can confirm the data was read correctly.

Step 2: Inspect the data

Before cleaning, understand what you have:

print(df.info())
print(df.isnull().sum())

info() shows the column types and isnull().sum() counts missing values per column.

Step 3: Clean missing values and duplicates

df = df.drop_duplicates()
df["preco"] = df["preco"].fillna(0)
df = df.dropna(subset=["cliente"])

We drop repeated rows, fill missing prices with 0 and discard rows without a customer.

Step 4: Fix the data types

df["data"] = pd.to_datetime(df["data"], format="%d/%m/%Y")
df["preco"] = df["preco"].astype(float)

Verify the result

Run df.info() and df.isnull().sum() again. Essential columns should have no missing values and dates should appear as datetime.

Conclusion

With a handful of pandas lines you turn a raw CSV into a reliable dataset, ready for analysis. What other transformation do you usually need on your files before analysing them?