Optimized PySpark: partitioning, caching and tuning for performance
João Barros
27 de May de 2025
2 min read
Apache Spark is powerful but requires attention to configuration and design details to run efficiently at scale. Small partitioning mistakes or poorly planned joins can multiply execution time by 10x.
Partitioning
The number of partitions defines parallelism. Too few partitions underuse the cluster; too many create shuffle overhead.
# See current number of partitions
print(df.rdd.getNumPartitions())
# Repartition (full shuffle — expensive)
df = df.repartition(200, "country")
# Coalesce (no shuffle — only reduces)
df = df.coalesce(50)
Strategic caching
# In-memory cache (faster, can be evicted)
df_filtered = df.filter(col("year") == 2024).cache()
df_filtered.count() # materializes the cache
# Persist with explicit level
from pyspark import StorageLevel
df_filtered.persist(StorageLevel.MEMORY_AND_DISK_SER)
# Release when no longer needed
df_filtered.unpersist()
Broadcast Join for small tables
from pyspark.sql.functions import broadcast
# Force broadcast of the small dimension — avoids shuffle on the big table
df_result = df_facts.join(broadcast(df_dim_product), "product_id")
Avoid unnecessary shuffles
# Bad: groupBy on high-cardinality columns without need
df.groupBy("customer_id", "product_id").agg(sum("amount"))
# Better: pre-aggregate before joins
df_agg = df.groupBy("product_id").agg(sum("amount").alias("total"))
df_agg.join(df_dim_product, "product_id")
Essential settings
spark.conf.set("spark.sql.shuffle.partitions", "200") # default 200
spark.conf.set("spark.sql.adaptive.enabled", "true") # AQE — auto-tuning
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
Conclusion
Spark performance is a combination of good design (partitioning, avoiding unnecessary shuffles, broadcasting small dimensions) with proper configuration (AQE, shuffle partitions). Use the Spark UI to identify slow stages and data skew before optimizing.