Getting started with PySpark: read and transform data

João Barros 13 de February de 2024 1 min read

Apache Spark processes large volumes of data in parallel, and PySpark is its Python interface. This tutorial shows the first steps: create a session, read data and do simple transformations.

Prerequisites

Python 3.9+ and PySpark installed (pip install pyspark) or an environment such as Databricks.
Basic knowledge of Python.
A data file (CSV or Parquet).

Step 1: Create the SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("primeiros-passos").getOrCreate()

Step 2: Read the data

df = spark.read.option("header", True).csv("vendas.csv")
df.show(5)

show() displays the first rows so you can confirm the read.

Step 3: Transform with the DataFrame API

from pyspark.sql.functions import col

vendas = (df
    .filter(col("valor") > 0)
    .groupBy("categoria")
    .sum("valor"))
vendas.show()

Step 4: Save the result

vendas.write.mode("overwrite").parquet("saida/vendas_por_categoria")

Verify the result

Read the saved file with spark.read.parquet(...) and confirm the aggregation by category is correct.

Conclusion

With a SparkSession, a read and a few transformations you are already processing data at scale with PySpark. Which large dataset would you like to process next?