Databricks Workflows: end-to-end data pipeline orchestration
João Barros
20 de October de 2025
2 min read
Databricks Workflows (formerly Jobs) let you orchestrate complex pipelines with multiple tasks — notebooks, Python scripts, SQL queries, Delta Live Tables — with dependencies, automatic retry and notifications.
Anatomy of a Workflow
Job
├── Task: ingest_bronze (Notebook: 01_ingest)
├── Task: transform_silver (Notebook: 02_transform, depends on ingest_bronze)
├── Task: aggregate_gold (Notebook: 03_aggregate, depends on transform_silver)
└── Task: refresh_powerbi (Python script, depends on aggregate_gold)
Create via REST API
POST /api/2.1/jobs/create
{
"name": "Daily_Sales_Pipeline",
"tasks": [
{
"task_key": "ingest_bronze",
"notebook_task": {"notebook_path": "/Pipelines/01_ingest"},
"new_cluster": {"spark_version": "15.4.x-scala2.12", "num_workers": 4}
},
{
"task_key": "transform_silver",
"depends_on": [{"task_key": "ingest_bronze"}],
"notebook_task": {"notebook_path": "/Pipelines/02_transform"},
"existing_cluster_id": "{{cluster_id}}"
}
],
"schedule": {"quartz_cron_expression": "0 0 6 * * ?", "timezone_id": "Europe/Lisbon"},
"email_notifications": {"on_failure": ["dados@bconcepts.pt"]}
}
Dynamic parameters
# In the notebook, read job parameters
dbutils.widgets.get("run_date")
# Pass parameters in the task
"base_parameters": {"run_date": "{{job.start_time.iso_date}}"}
Retry and timeouts
"max_retries": 2,
"min_retry_interval_millis": 300000, // 5 minutes between retries
"timeout_seconds": 3600 // fail if it takes more than 1h
Conclusion
Databricks Workflows remove the need for external orchestration tools for Spark pipelines. For more complex cross-platform cases, combine with Azure Data Factory by calling the Databricks REST API.