Explaining the History of Data Lakehouse

Pavol Z. Kutaj - Oct 14 - - Dev Community

Overview

This overview explains the brief history of the Data Lakehouse concept, starting with Data Warehouses, based on personal experience and the influential Databricks paper that continues to impact the industry three years after its initial publication.
But first, data "lake" is part of metaphorical structure forcing us to think of data as something "natural", something given. See a constructivist critique on a great blogpost of my boss:

Why we need to stop thinking of data as oil

1980-2010: The Data Warehouse (OLAP?) Era

  • Early work focused on data warehouses, such as SAP B/W.
  • Data warehouses, invented in the 1980s, promised:
    1. Homogeneous data
    2. Compressed/high-performance storage
    3. Historical data for decision support (reporting)

2010-2020: The Data Lake Revolution

  • Around 2010, a crisis in data warehousing emerged.
  • Storage and Compute are decoupled — allowing multiple compute engines to connect to a single storage system (e.g., HDFS + Spark, Presto, Hive, etc.).
  • Triggered by the iPhone and Cloud computing (Big Data revolution).
  • Data Lakes emerged as a solution.
  • Metaphor: Imagine a vast lake (of data) with a large warehouse nearby, but that warehouse still 1000 times smaller in capacity than the lake itself.
  • Challenges of Data Warehouse on Lake architecture:
    • Isolated environments
    • Error-prone data transfer from lake to warehouse
    • Slow and labor-intensive processes
    • Data often became stale by the time it reached the warehouse

2020-Present: The Data Lakehouse Era

  • Key developments:
    • Hudi (2017, Uber)
    • Iceberg (2018, Netflix)
    • Delta (2019, Databricks)
  • These formats provide functionality previously exclusive to warehouses:
    • Abstract table layer over files + ACID tables: Transactional capabilities
    • Support for indexing
    • Transactional and relational model support for file-based data
  • Data Lakehouse concept:
    • Not just near the lake, but floating on it like an oil rig
    • Directly accesses and utilizes lake data without specific connections
    • Cost-effective and easily replaceable
    • Multiple "houses" can coexist on the same lake
    • Rapid, reliable data access without extra engineering work
    • Immediate value from lake data available in the "house"

The Data Lakehouse combines the best features of data warehouses and data lakes, offering a flexible, scalable, and efficient solution for modern data management needs. This concept forms the foundation of products like Databricks' data platform.

For more detailed information, refer to:

  1. CIDR 2021 Paper
  2. Onehouse Blog
  3. YouTube Video
. . . . . . . .