Databricks Workflows: end-to-end data pipeline orchestration

João Barros 20 de October de 2025 2 min read

Databricks Workflows (formerly Jobs) let you orchestrate complex pipelines with multiple tasks — notebooks, Python scripts, SQL queries, Delta Live Tables — with dependencies, automatic retry and notifications.

Anatomy of a Workflow

Job
├── Task: ingest_bronze       (Notebook: 01_ingest)
├── Task: transform_silver    (Notebook: 02_transform, depends on ingest_bronze)
├── Task: aggregate_gold      (Notebook: 03_aggregate, depends on transform_silver)
└── Task: refresh_powerbi     (Python script, depends on aggregate_gold)

Create via REST API

POST /api/2.1/jobs/create
{
  "name": "Daily_Sales_Pipeline",
  "tasks": [
    {
      "task_key": "ingest_bronze",
      "notebook_task": {"notebook_path": "/Pipelines/01_ingest"},
      "new_cluster": {"spark_version": "15.4.x-scala2.12", "num_workers": 4}
    },
    {
      "task_key": "transform_silver",
      "depends_on": [{"task_key": "ingest_bronze"}],
      "notebook_task": {"notebook_path": "/Pipelines/02_transform"},
      "existing_cluster_id": "{{cluster_id}}"
    }
  ],
  "schedule": {"quartz_cron_expression": "0 0 6 * * ?", "timezone_id": "Europe/Lisbon"},
  "email_notifications": {"on_failure": ["dados@bconcepts.pt"]}
}

Dynamic parameters

# In the notebook, read job parameters
dbutils.widgets.get("run_date")

# Pass parameters in the task
"base_parameters": {"run_date": "{{job.start_time.iso_date}}"}

Retry and timeouts

"max_retries": 2,
"min_retry_interval_millis": 300000,  // 5 minutes between retries
"timeout_seconds": 3600               // fail if it takes more than 1h

Conclusion

Databricks Workflows remove the need for external orchestration tools for Spark pipelines. For more complex cross-platform cases, combine with Azure Data Factory by calling the Databricks REST API.