Mapping Data Flows in ADF: visual ETL transformations without code
João Barros
07 de February de 2025
1 min read
Mapping Data Flows in Azure Data Factory let you build complex ETL transformations through a visual interface, without writing Spark code. Execution runs on a Microsoft-managed Spark cluster, with automatic scaling.
Anatomy of a Data Flow
Source → [Transformations] → Sink
Available transformations:
Select / Rename — select and rename columns
Filter — filter rows by condition
Derived Column — create/modify columns with expressions
Aggregate — GROUP BY with aggregate functions
Join / Lookup — join datasets
Conditional Split — split the flow by condition
Flatten — denormalize nested JSON structures
Window — window functions (ROW_NUMBER, LAG, etc.)
Sink — write destination
Derived Column expression
// Calculate margin and normalize country
margin_pct: toDecimal((revenue - cost) / revenue * 100, 2)
country_norm: upper(trim(country))
load_date: currentTimestamp()
year_month: toString(year(sale_date)) + '-' + lpad(toString(month(sale_date)), 2, '0')
Aggregate — GROUP BY with multiple metrics
Group by: country_norm, year_month
Aggregates:
total_revenue: sum(revenue)
num_transactions: count(1)
avg_ticket: avg(revenue)
max_sale: max(revenue)
Debug and performance
// Enable the debug cluster (takes ~2 min to start)
// Data preview at each transformation — essential to validate
// Optimizations:
- Source: enable partition → read parallelism
- Sink: set batch size = 10000 for SQL
- Avoid Cartesian joins (no join condition)
- Use Broadcast for small dimensions (< 1GB)
Conclusion
Mapping Data Flows make complex ETL transformations accessible to engineers without Spark experience. For teams with SQL analysts, the expression syntax is familiar. The trade-off is the slow Spark cluster startup — use them for batches, not for frequent micro-batches.