(+351) 21 24 10006  ·  info@bconcepts.pt
Carnaxide, Lisbon
Azure Synapse Analytics
Azure Synapse Analytics 1 min

Dedicated SQL Pool in Synapse: data distribution and optimization

João Barros 19 de February de 2025 1 min read

The Dedicated SQL Pool (formerly SQL DW) uses MPP (Massively Parallel Processing) with 60 distribution nodes. How you distribute each table across the 60 nodes directly defines the performance of joins and aggregations.

Distribution types

HASH — distributes rows by hash of a column
  ✓ Ideal for large fact tables
  ✗ Poor if the column has few unique values (data skew)

ROUND_ROBIN — distributes evenly without criteria
  ✓ Good for temporary staging tables
  ✗ Causes data movement in joins

REPLICATE — copies the whole table to each node
  ✓ Ideal for small dimensions (< 2GB)
  ✗ High write and update cost

Create tables with proper distribution

-- Large fact table → HASH on the most frequent join key
CREATE TABLE dbo.FactSales (
    sale_id     BIGINT       NOT NULL,
    customer_id INT          NOT NULL,
    sale_date   DATE         NOT NULL,
    revenue     DECIMAL(18,2)NOT NULL
)
WITH (
    DISTRIBUTION = HASH(customer_id),
    CLUSTERED COLUMNSTORE INDEX
);

-- Small dimension → REPLICATE
CREATE TABLE dbo.DimProduct (
    product_id   INT         NOT NULL,
    name         VARCHAR(200)NOT NULL,
    category     VARCHAR(100)
)
WITH (
    DISTRIBUTION = REPLICATE,
    CLUSTERED COLUMNSTORE INDEX
);

Diagnose data skew

-- See row distribution across the 60 nodes
DBCC PDW_SHOWSPACEUSED('dbo.FactSales');

-- If one node has 10x more rows → poor hash column choice
-- Solution: change to another column or use ROUND_ROBIN + pre-grouping

Statistics and result caching

-- Create statistics (essential for the optimizer)
CREATE STATISTICS stats_customer ON dbo.FactSales(customer_id);

-- Enable result set caching (identical queries = zero cost)
ALTER DATABASE SynapseDW SET RESULT_SET_CACHING ON;

Conclusion

In the Dedicated SQL Pool, architecture comes first: distribute facts by hash on the join column, replicate small dimensions and create statistics on all columns used in joins and filters. These three rules solve 80% of performance problems.

Share: