Dedicated SQL Pool in Synapse: data distribution and optimization
João Barros
19 de February de 2025
1 min read
The Dedicated SQL Pool (formerly SQL DW) uses MPP (Massively Parallel Processing) with 60 distribution nodes. How you distribute each table across the 60 nodes directly defines the performance of joins and aggregations.
Distribution types
HASH — distributes rows by hash of a column
✓ Ideal for large fact tables
✗ Poor if the column has few unique values (data skew)
ROUND_ROBIN — distributes evenly without criteria
✓ Good for temporary staging tables
✗ Causes data movement in joins
REPLICATE — copies the whole table to each node
✓ Ideal for small dimensions (< 2GB)
✗ High write and update cost
Create tables with proper distribution
-- Large fact table → HASH on the most frequent join key
CREATE TABLE dbo.FactSales (
sale_id BIGINT NOT NULL,
customer_id INT NOT NULL,
sale_date DATE NOT NULL,
revenue DECIMAL(18,2)NOT NULL
)
WITH (
DISTRIBUTION = HASH(customer_id),
CLUSTERED COLUMNSTORE INDEX
);
-- Small dimension → REPLICATE
CREATE TABLE dbo.DimProduct (
product_id INT NOT NULL,
name VARCHAR(200)NOT NULL,
category VARCHAR(100)
)
WITH (
DISTRIBUTION = REPLICATE,
CLUSTERED COLUMNSTORE INDEX
);
Diagnose data skew
-- See row distribution across the 60 nodes
DBCC PDW_SHOWSPACEUSED('dbo.FactSales');
-- If one node has 10x more rows → poor hash column choice
-- Solution: change to another column or use ROUND_ROBIN + pre-grouping
Statistics and result caching
-- Create statistics (essential for the optimizer)
CREATE STATISTICS stats_customer ON dbo.FactSales(customer_id);
-- Enable result set caching (identical queries = zero cost)
ALTER DATABASE SynapseDW SET RESULT_SET_CACHING ON;
Conclusion
In the Dedicated SQL Pool, architecture comes first: distribute facts by hash on the join column, replicate small dimensions and create statistics on all columns used in joins and filters. These three rules solve 80% of performance problems.