Serverless SQL in Synapse: query the Data Lake without infrastructure

João Barros 26 de September de 2024 1 min read

The Serverless SQL Pool in Azure Synapse Analytics is a query engine that lets you query files on Azure Data Lake Storage with standard T-SQL, without configuring infrastructure. You only pay for the queries executed (TB processed).

Query Parquet directly

-- Read a Parquet file from ADLS
SELECT TOP 100 *
FROM OPENROWSET(
    BULK 'https://stadatalake.dfs.core.windows.net/silver/sales/**',
    FORMAT = 'PARQUET'
) AS [result];

-- With an explicit schema for better performance
SELECT *
FROM OPENROWSET(
    BULK 'https://stadatalake.dfs.core.windows.net/silver/sales/',
    FORMAT = 'PARQUET'
)
WITH (
    sale_date   DATE,
    customer_id INT,
    revenue     DECIMAL(18,2),
    country     VARCHAR(100)
) AS sales
WHERE sale_date >= '2024-01-01';

Create External Tables (persistent views over files)

-- Data source pointing to ADLS
CREATE EXTERNAL DATA SOURCE ds_silver
WITH (LOCATION = 'https://stadatalake.dfs.core.windows.net/silver');

-- Parquet format
CREATE EXTERNAL FILE FORMAT ff_parquet
WITH (FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec');

-- External table
CREATE EXTERNAL TABLE dbo.ext_sales (
    sale_date DATE, customer_id INT, revenue DECIMAL(18,2), country VARCHAR(100)
)
WITH (
    LOCATION = 'sales/',
    DATA_SOURCE = ds_silver,
    FILE_FORMAT = ff_parquet
);

-- Query it like a normal table
SELECT country, SUM(revenue) AS total
FROM dbo.ext_sales
WHERE sale_date >= '2024-01-01'
GROUP BY country
ORDER BY total DESC;

Access control with credentials

-- Access via the Synapse Managed Identity
CREATE DATABASE SCOPED CREDENTIAL synapse_msi
WITH IDENTITY = 'Managed Identity';

-- Use it in the data source
CREATE EXTERNAL DATA SOURCE ds_silver_secure
WITH (LOCATION = 'https://stadatalake.dfs.core.windows.net/silver',
      CREDENTIAL = synapse_msi);

Conclusion

The Serverless SQL Pool is ideal for ad-hoc Data Lake exploration, creating virtual data marts and exposing data to BI tools without moving it. The pay-per-query model makes it economical for occasional queries over large data volumes.