Spark Pools in Synapse: integrated distributed processing
João Barros
08 de December de 2025
1 min read
Apache Spark Pools in Azure Synapse Analytics offer distributed processing integrated with the rest of the Synapse ecosystem — ADLS, SQL Pools, Synapse Link — without needing a separate Databricks workspace.
Create a Spark Pool
// Synapse Studio → Manage → Apache Spark Pools → New
Name: sparkpool-medium
Node size: Medium (8 vCores, 56 GB RAM)
Autoscale: Enabled (min 3, max 10 nodes)
Auto-pause: 15 minutes idle
Apache Spark version: 3.4
Basic PySpark notebook
%%pyspark
# Read from ADLS (authentication via the Synapse Managed Identity)
df = spark.read.format("parquet").load("abfss://silver@stadatalake.dfs.core.windows.net/sales/")
# Transform
from pyspark.sql.functions import col, year, month, sum as _sum
df_gold = (df
.filter(col("status") == "Complete")
.groupBy(year("sale_date").alias("year"), "country")
.agg(_sum("revenue").alias("total_revenue"))
.orderBy("year", col("total_revenue").desc()))
# Save as Delta in Gold
df_gold.write.format("delta").mode("overwrite").saveAsTable("gold.annual_sales_country")
Read/write to a Dedicated SQL Pool
%%pyspark
# Read from the Dedicated SQL Pool (via internal Synapse JDBC)
df_dim = spark.read.synapsesql("SynapseDW.dbo.DimProduct")
# Write to the Dedicated SQL Pool
df_gold.write.synapsesql("SynapseDW.gold.FactAnnualSales", mode="overwrite")
Pipeline variables in the notebook
%%pyspark
# Receive parameters from the Synapse pipeline
import mssparkutils
ref_date = mssparkutils.runtime.context.get("notebookParam_date") or "2024-01-01"
print(f"Processing from: {ref_date}")
Conclusion
Spark Pools in Synapse are the natural choice for teams that already use Synapse Analytics and want Spark processing without managing a separate Databricks workspace. Native integration with ADLS, SQL Pools and Synapse pipelines significantly simplifies the architecture.