Skip to content

Data Warehouse Architecture

A data warehouse (DWH) is a subject-oriented, integrated, non-volatile, time-variant collection of data organized to support decision-making (Bill Inmon). DWH exists because OLTP and OLAP workloads are fundamentally incompatible in one system.

OLTP vs OLAP

Characteristic OLTP OLAP
Purpose Transaction processing Analytical queries
Queries Simple, frequent, identical Complex, infrequent, ad-hoc
Users Many concurrent Few analysts
Data Current, operational Historical, aggregated
Operations Read/write (CRUD) Mostly read
Latency Micro/milliseconds Seconds to minutes
Examples PostgreSQL, MySQL (per-service) Snowflake, ClickHouse, Greenplum

DWH Properties (Inmon)

Property Meaning
Subject-oriented Covers specific business domains; irrelevant data excluded
Integrated No contradictions between data from different sources
Non-volatile Data only appended, never modified or deleted
Time-variant Maintains maximum historical depth

DWH Layers

Sources -> Staging Area -> Central Warehouse -> Data Marts -> BI/Analytics
Layer Purpose
Staging Area Raw data landing, cleansing, normalization before loading
MDM (Master Data) Reference/lookup data ensuring consistency
Central Warehouse Historical, subject-oriented, integrated data
Data Marts / Cubes Specialized views for specific analytical tasks

Design Approaches

Kimball (Bottom-Up)

  • DWH = collection of data marts
  • Design marts per consumer need (star schema)
  • Fast to deliver business value
  • Drawbacks: marts may define metrics differently; hard to standardize

Inmon (Top-Down)

  • Build enterprise DWH first (3NF), then derive marts
  • Single source of truth, better consistency
  • Drawbacks: longer initial development, more complex

Data Vault 2.0

  • Handles many sources with different schemas
  • Append-only, full auditability
  • Business rules isolated in mart layer
  • See data vault for details

Platform Evolution

Gen 1: RDBMS DWH (Oracle, DB2) - 1990s
  -> Rigid schema, ETL, batch only

Gen 2: Hadoop ecosystem (HDFS + MapReduce) - 2007+
  -> Data Lake, schema-on-read, horizontal scale

Gen 3: Cloud DWH + Lakehouse - 2015+
  -> Snowflake, Databricks, BigQuery
  -> ELT, auto-scaling, separation of compute/storage

Gen 4: Data Mesh / Data Products - 2020+
  -> Domain-owned data, self-serve platform, federated governance

Choosing Architecture

Scenario Recommended
Small company, structured data, BI focus Cloud DWH (Snowflake, BigQuery)
Mixed data types, ML workloads Data Lakehouse (Databricks, Iceberg)
Enterprise, many sources, compliance DWH + Data Lake hybrid
Real-time + batch analytics Lambda/Kappa with streaming layer
Many autonomous data teams Data Mesh with self-serve platform

Gotchas

  • Star schema denormalization trades storage for query performance - acceptable for OLAP
  • Data Vault adds many joins - mart layer must pre-join for acceptable query performance
  • OLTP databases should never be queried directly for analytics - always replicate to DWH
  • Batch window must complete before business hours - monitor and alert on SLA breaches
  • Data Lake without governance becomes "data swamp"

See Also