Delta Lake in Databricks: ACID transactions for data at scale
João Barros
08 de August de 2024
2 min read
Delta Lake is an open-source storage layer that adds transactional reliability to Parquet files in the Data Lake. In Databricks, it is the default format for all tables, bringing ACID guarantees that plain Parquet does not offer.
What Delta Lake brings
- ACID Transactions — atomic operations ensure the table state is always consistent, even with mid-way failures.
- Schema Enforcement — rejects writes with incompatible schema, preventing silent corruption.
- Time Travel — access previous table versions by timestamp or version number.
- Upserts with MERGE — atomic insert/update/delete operations in a single statement.
Create and write a Delta table
# PySpark
df.write.format("delta").mode("overwrite").save("/mnt/datalake/silver/sales")
# Or via SQL
CREATE TABLE sales USING DELTA LOCATION '/mnt/datalake/silver/sales'
Time Travel
# By version number
df_v3 = spark.read.format("delta").option("versionAsOf", 3).load("/mnt/datalake/silver/sales")
# By timestamp
df_jan = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/mnt/datalake/silver/sales")
MERGE (Upsert)
from delta.tables import DeltaTable
delta_tbl = DeltaTable.forPath(spark, "/mnt/datalake/silver/sales")
delta_tbl.alias("tgt").merge(
df_new.alias("src"),
"tgt.sale_id = src.sale_id"
).whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()
Optimize and vacuum
OPTIMIZE sales ZORDER BY (sale_date, customer_id)
VACUUM sales RETAIN 168 HOURS -- keep 7 days of history
Conclusion
Delta Lake is the foundation of any modern Data Lakehouse. In Databricks, it is natively integrated and enabled by default — take advantage of MERGE for incremental ingestion and Time Travel for audits and error recovery.