Data Lake and Lakehouse¶

A data lake stores all data types without strict schema (schema-on-read). A lakehouse combines data lake storage flexibility with DWH capabilities like ACID transactions and SQL access.

Data Lake¶

What It Stores¶

Structured: Database tables, CSV
Semi-structured: JSON, XML, Avro
Unstructured: Logs, video, images, text

Advantages¶

Scalable (add nodes horizontally)
Economical (built on open-source: Hadoop, S3)
Universal (all data types in one system)
Fast hypothesis testing (no upfront schema design)

Disadvantages¶

Low data quality without governance controls
Can become a "data swamp" without cataloging
Difficulty determining data value

Modern Pattern¶

Data Lake collects all raw data
DWH (or Lakehouse) stores processed analytical data

Data Lakehouse¶

Combines Data Lake storage with DWH capabilities: - Metadata catalogs and schemas - ACID transaction support - SQL access to data - Optimized for both BI and ML workloads

Open Table Formats¶

Technology	Key Features
Delta Lake	ACID transactions, time travel, schema evolution, Z-ordering
Apache Iceberg	Hidden partitioning, partition evolution, snapshot isolation, vendor-neutral
Apache Hudi	Upserts, incremental processing, record-level changes, compaction

All three provide: - ACID semantics on object storage (S3, GCS, ADLS) - Schema evolution without rewriting data - Time travel / snapshot queries - Metadata management for query optimization

Key Facts¶

S3 is 5-10x cheaper than HDFS for storage
Lakehouse enables running BI and ML on the same data without copying
Separation of storage and compute allows independent scaling
Implementations: Databricks (Delta Lake), Snowflake, Apache Iceberg on Spark/Trino

Gotchas¶

Data Lake without governance becomes "data swamp" - always catalog and document
Table formats require a query engine that understands them (Spark, Trino, Flink)
Time travel has storage cost - old snapshots must be periodically cleaned
Schema evolution does not mean schema-less - define schemas for discoverability