Spark Pools in Synapse: integrated distributed processing

João Barros 08 de December de 2025 1 min read

Apache Spark Pools in Azure Synapse Analytics offer distributed processing integrated with the rest of the Synapse ecosystem — ADLS, SQL Pools, Synapse Link — without needing a separate Databricks workspace.

Create a Spark Pool

// Synapse Studio → Manage → Apache Spark Pools → New
Name: sparkpool-medium
Node size: Medium (8 vCores, 56 GB RAM)
Autoscale: Enabled (min 3, max 10 nodes)
Auto-pause: 15 minutes idle
Apache Spark version: 3.4

Basic PySpark notebook

%%pyspark
# Read from ADLS (authentication via the Synapse Managed Identity)
df = spark.read.format("parquet").load("abfss://silver@stadatalake.dfs.core.windows.net/sales/")

# Transform
from pyspark.sql.functions import col, year, month, sum as _sum

df_gold = (df
    .filter(col("status") == "Complete")
    .groupBy(year("sale_date").alias("year"), "country")
    .agg(_sum("revenue").alias("total_revenue"))
    .orderBy("year", col("total_revenue").desc()))

# Save as Delta in Gold
df_gold.write.format("delta").mode("overwrite").saveAsTable("gold.annual_sales_country")

Read/write to a Dedicated SQL Pool

%%pyspark
# Read from the Dedicated SQL Pool (via internal Synapse JDBC)
df_dim = spark.read.synapsesql("SynapseDW.dbo.DimProduct")

# Write to the Dedicated SQL Pool
df_gold.write.synapsesql("SynapseDW.gold.FactAnnualSales", mode="overwrite")

Pipeline variables in the notebook

%%pyspark
# Receive parameters from the Synapse pipeline
import mssparkutils
ref_date = mssparkutils.runtime.context.get("notebookParam_date") or "2024-01-01"
print(f"Processing from: {ref_date}")

Conclusion

Spark Pools in Synapse are the natural choice for teams that already use Synapse Analytics and want Spark processing without managing a separate Databricks workspace. Native integration with ADLS, SQL Pools and Synapse pipelines significantly simplifies the architecture.