Skip to content

Big Data and ML Architecture

Architecture for processing large data volumes and integrating ML models into production. The architect chooses appropriate processing platforms, designs data pipelines, and ensures data governance.

Data Processing Paradigms

Paradigm Processing Technologies Best For
Batch Scheduled intervals, large volumes Hadoop, Spark, Hive ETL, reports, historical analysis
Stream Continuous, real-time Kafka Streams, Flink, Storm Fraud detection, real-time recs

Lambda Architecture

Combines batch and stream. Batch layer (accurate), speed layer (real-time), serving layer merges both. Drawback: maintaining two codebases.

Kappa Architecture

Only stream processing for everything. Reprocessing by replaying stream. Simpler but not for all workloads.

Data Storage Solutions

Solution Approach Technologies Risk
Data Lake Raw data, schema-on-read HDFS, S3, Azure Data Lake "Data swamp" without governance
Data Warehouse Processed, schema-on-write Snowflake, BigQuery, Redshift, ClickHouse Rigid schema changes
Data Lakehouse Lake flexibility + warehouse structure Delta Lake, Apache Iceberg, Hudi Newer, less mature

Data Modeling Approaches

Model Loading Change Tracking Best For
Inmon (3NF) Complex (dependency order) Poor Already normalized sources
Kimball (Star) Simpler SCD Type 2 adds complexity Simple analytics
Data Vault Simplest (independent loads) Native (satellites + timestamps) Evolving, changing data
Anchor Model Most flexible Most flexible Extreme flexibility needs

ETL vs ELT

Approach Process Best For
ETL Transform before loading Traditional DWH
ELT Load raw, transform in target Modern cloud DWH (leverage scalable compute)

ML Pipeline

Data collection -> Preparation/cleaning -> Feature engineering ->
  Model training -> Evaluation -> Deployment -> Monitoring/retraining

Model Serving Patterns

Pattern Description Use Case
Batch prediction Pre-compute predictions Recommendations, risk scores
Real-time inference Online prediction via API Search, fraud detection
Edge inference On-device Mobile, IoT
Embedded model Model within app code Lightweight predictions

MLOps

Applying DevOps to ML: version control for data + models, automated training pipelines, A/B testing for deployment, monitoring model drift.

Feature Store: Centralized ML features repository. Consistency between training and serving. Products: Feast, Tecton, Vertex AI Feature Store.

Data Pipeline Architecture Example

[Source 1C] --> [Kafka topic] --> [Airflow orchestrator] --> [DWH (PostgreSQL)]
[Source CRM] --> [Kafka topic] -->                               |
[External API] --> [API Fetcher] -->                             |
                                                          [Data Marts]
                                                               |
                                                          [BI Dashboard]

Key decisions: - Internal systems push changes to Kafka (source knows when data changes) - External API: pull (our system fetches periodically) - Airflow for ETL orchestration (monitoring, retries, error handling) - Custom PHP/Python scripts strongly discouraged - use proper orchestration

Technology Stack

Category Technologies
Batch processing Hadoop, Spark, Hive
Stream processing Kafka Streams, Flink, Spark Streaming
Storage HDFS, S3, Delta Lake
Data warehouse Snowflake, BigQuery, Redshift, ClickHouse
Orchestration Airflow, Prefect, Dagster
ML platforms MLflow, Kubeflow, SageMaker, Vertex AI
Data quality Great Expectations, dbt tests

Architecture Considerations

  • Volume - TB vs PB determines storage and processing choices
  • Velocity - real-time drives streaming vs batch
  • Variety - structured/semi-structured/unstructured affects storage
  • Data governance - privacy (GDPR), access control, audit trails, lineage
  • Cost - right-sizing compute, storage tiering, spot instances
  • Data quality - garbage in = garbage out. Validate at pipeline entry

Gotchas

  • Data lake without governance becomes "data swamp" - nobody can find or trust data
  • Real-time not always needed - daily batch may suffice for director checking morning reports
  • Source data changes retroactively - sales data can change for 1+ year (refunds, recalculations). Design for immutable history (Data Vault)
  • Pull from source DB is risky - you don't know about all data changes, schema changes break scripts. Prefer push model

See Also