What is a Lakehouse and how it combines a data lake and a warehouse

João Barros 17 de October de 2023 2 min read

The term Lakehouse describes an architecture that joins the flexibility of a data lake with the reliability of a data warehouse. Instead of choosing between the two, you get the benefits of both in one place.

Prerequisites

An understanding of a data lake (files in storage) and a data warehouse (SQL tables).
Familiarity with file formats such as Parquet.
A platform with Delta tables (Databricks, Microsoft Fabric or Spark with Delta Lake).

Step 1: Understand the problem it solves

A data lake stores everything cheaply, but without guarantees (no transactions, no strong schema). A warehouse is reliable but rigid and expensive. The Lakehouse puts a table layer (Delta Lake) on top of the files to bring transactions and schema to the lake.

Step 2: Store data in table format

Instead of a loose CSV, write in Delta:

df.write.format("delta").save("/dados/vendas")

Step 3: Query with SQL

SELECT categoria, SUM(valor) AS total
FROM delta.`/dados/vendas`
GROUP BY categoria;

Step 4: Take advantage of transactions and history

Delta tables support ACID and time travel — querying an earlier version of the data:

SELECT categoria, valor
FROM delta.`/dados/vendas` VERSION AS OF 3;

Verify the result

Describe the table history and confirm you can see the versions:

DESCRIBE HISTORY delta.`/dados/vendas`;

Conclusion

With the Lakehouse you no longer need to move data between lake and warehouse — the same layer serves engineering, analytics and BI. Which part of your architecture would you simplify with a Lakehouse?