Getting started with PySpark: read and transform data
João Barros
13 de February de 2024
1 min read
Apache Spark processes large volumes of data in parallel, and PySpark is its Python interface. This tutorial shows the first steps: create a session, read data and do simple transformations.
Prerequisites
- Python 3.9+ and PySpark installed (
pip install pyspark) or an environment such as Databricks. - Basic knowledge of Python.
- A data file (CSV or Parquet).
Step 1: Create the SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("primeiros-passos").getOrCreate()
Step 2: Read the data
df = spark.read.option("header", True).csv("vendas.csv")
df.show(5)
show() displays the first rows so you can confirm the read.

Step 3: Transform with the DataFrame API
from pyspark.sql.functions import col
vendas = (df
.filter(col("valor") > 0)
.groupBy("categoria")
.sum("valor"))
vendas.show()
Step 4: Save the result
vendas.write.mode("overwrite").parquet("saida/vendas_por_categoria")
Verify the result
Read the saved file with spark.read.parquet(...) and confirm the aggregation by category is correct.
Conclusion
With a SparkSession, a read and a few transformations you are already processing data at scale with PySpark. Which large dataset would you like to process next?