Serverless SQL in Synapse: query the Data Lake without infrastructure
João Barros
26 de September de 2024
1 min read
The Serverless SQL Pool in Azure Synapse Analytics is a query engine that lets you query files on Azure Data Lake Storage with standard T-SQL, without configuring infrastructure. You only pay for the queries executed (TB processed).
Query Parquet directly
-- Read a Parquet file from ADLS
SELECT TOP 100 *
FROM OPENROWSET(
BULK 'https://stadatalake.dfs.core.windows.net/silver/sales/**',
FORMAT = 'PARQUET'
) AS [result];
-- With an explicit schema for better performance
SELECT *
FROM OPENROWSET(
BULK 'https://stadatalake.dfs.core.windows.net/silver/sales/',
FORMAT = 'PARQUET'
)
WITH (
sale_date DATE,
customer_id INT,
revenue DECIMAL(18,2),
country VARCHAR(100)
) AS sales
WHERE sale_date >= '2024-01-01';
Create External Tables (persistent views over files)
-- Data source pointing to ADLS
CREATE EXTERNAL DATA SOURCE ds_silver
WITH (LOCATION = 'https://stadatalake.dfs.core.windows.net/silver');
-- Parquet format
CREATE EXTERNAL FILE FORMAT ff_parquet
WITH (FORMAT_TYPE = PARQUET, DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec');
-- External table
CREATE EXTERNAL TABLE dbo.ext_sales (
sale_date DATE, customer_id INT, revenue DECIMAL(18,2), country VARCHAR(100)
)
WITH (
LOCATION = 'sales/',
DATA_SOURCE = ds_silver,
FILE_FORMAT = ff_parquet
);
-- Query it like a normal table
SELECT country, SUM(revenue) AS total
FROM dbo.ext_sales
WHERE sale_date >= '2024-01-01'
GROUP BY country
ORDER BY total DESC;
Access control with credentials
-- Access via the Synapse Managed Identity
CREATE DATABASE SCOPED CREDENTIAL synapse_msi
WITH IDENTITY = 'Managed Identity';
-- Use it in the data source
CREATE EXTERNAL DATA SOURCE ds_silver_secure
WITH (LOCATION = 'https://stadatalake.dfs.core.windows.net/silver',
CREDENTIAL = synapse_msi);
Conclusion
The Serverless SQL Pool is ideal for ad-hoc Data Lake exploration, creating virtual data marts and exposing data to BI tools without moving it. The pay-per-query model makes it economical for occasional queries over large data volumes.