Mapping Data Flows in ADF: visual ETL transformations without code

João Barros 07 de February de 2025 1 min read

Mapping Data Flows in Azure Data Factory let you build complex ETL transformations through a visual interface, without writing Spark code. Execution runs on a Microsoft-managed Spark cluster, with automatic scaling.

Anatomy of a Data Flow

Source → [Transformations] → Sink

Available transformations:
  Select / Rename     — select and rename columns
  Filter              — filter rows by condition
  Derived Column      — create/modify columns with expressions
  Aggregate           — GROUP BY with aggregate functions
  Join / Lookup       — join datasets
  Conditional Split   — split the flow by condition
  Flatten             — denormalize nested JSON structures
  Window              — window functions (ROW_NUMBER, LAG, etc.)
  Sink                — write destination

Derived Column expression

// Calculate margin and normalize country
margin_pct: toDecimal((revenue - cost) / revenue * 100, 2)
country_norm: upper(trim(country))
load_date:   currentTimestamp()
year_month:  toString(year(sale_date)) + '-' + lpad(toString(month(sale_date)), 2, '0')

Aggregate — GROUP BY with multiple metrics

Group by: country_norm, year_month
Aggregates:
  total_revenue:    sum(revenue)
  num_transactions: count(1)
  avg_ticket:       avg(revenue)
  max_sale:         max(revenue)

Debug and performance

// Enable the debug cluster (takes ~2 min to start)
// Data preview at each transformation — essential to validate
// Optimizations:
  - Source: enable partition → read parallelism
  - Sink: set batch size = 10000 for SQL
  - Avoid Cartesian joins (no join condition)
  - Use Broadcast for small dimensions (< 1GB)

Conclusion

Mapping Data Flows make complex ETL transformations accessible to engineers without Spark experience. For teams with SQL analysts, the expression syntax is familiar. The trade-off is the slow Spark cluster startup — use them for batches, not for frequent micro-batches.