Data Engineering¶
Knowledge base covering ETL/ELT, data pipelines, data warehousing, distributed computing, and the modern data stack.
Concepts and Architecture¶
- [[etl-elt-pipelines]] - ETL vs ELT, pipeline design, processing modes, idempotency
- [[dwh-architecture]] - OLTP vs OLAP, DWH layers, Kimball vs Inmon, platform evolution
- [[data-modeling]] - normalization (1NF-3NF), ER diagrams, keys, deduplication patterns
- [[dimensional-modeling]] - star/snowflake schema, fact/dimension tables, Kimball design
- [[data-vault]] - Hub/Link/Satellite, Data Vault 2.0, anchor modeling
- [[scd-patterns]] - slowly changing dimensions, SCD2 merge logic
- [[data-lake-lakehouse]] - data lake, lakehouse, Delta Lake, Iceberg, Hudi
- [[data-quality]] - quality dimensions, observability, monitoring, alerting
- [[data-governance-catalog]] - DAMA DMBOK, data catalog, GDPR compliance
- [[data-lineage-metadata]] - lineage types, metadata categories, Prometheus+Grafana
- [[file-formats]] - Parquet, ORC, Avro, CSV comparison
Distributed Processing¶
- [[apache-spark-core]] - Spark architecture, execution model, Catalyst optimizer
- [[pyspark-dataframe-api]] - DataFrame operations, schemas, I/O, Spark SQL
- [[spark-optimization]] - partitioning, skew handling, broadcast joins, AQE
- [[spark-streaming]] - Structured Streaming, micro-batch, DStreams
- [[apache-kafka]] - event streaming, PubSub, topics, consumer groups
- [[mapreduce]] - Map/Reduce paradigm, shuffle, Hadoop Streaming
Storage and Databases¶
- [[hadoop-hdfs]] - HDFS architecture, blocks, replication, small files problem
- [[apache-hive]] - SQL-on-Hadoop, Metastore, join strategies (MapJoin, SMB)
- [[hbase]] - columnar NoSQL, row key, column families, versioning
- [[clickhouse]] - columnar OLAP, partitions, granules, primary key, functions
- [[clickhouse-engines]] - MergeTree family, compression, skip indexes
- [[greenplum-mpp]] - MPP architecture, distribution, motion operators
- [[postgresql-administration]] - transactions, MVCC, PL/pgSQL, query optimization
- [[mongodb-nosql]] - document store, CAP theorem, aggregation pipelines
Infrastructure and Tools¶
- [[apache-airflow]] - DAG orchestration, operators, TaskFlow API, XCom
- [[cloud-data-platforms]] - AWS/GCP/Azure, Snowflake, BigQuery, S3
- [[docker-for-de]] - containers, Dockerfile, docker-compose
- [[kubernetes-for-de]] - K8s architecture, Spark on K8s, Helm
- [[yarn-resource-management]] - YARN vs JobTracker, queues, schedulers
Cross-Cutting¶
- [[mlops-feature-store]] - MLflow, feature stores, model serving, CRISP-DM
- [[sql-for-de]] - window functions, CTEs, recursive queries, optimization
- [[python-for-de]] - database access, Pandas, functional programming, testing
Cross-Topic Links¶
- [[sql-databases/index]] - deep SQL reference
- [[python/index]] - Python language fundamentals
- [[devops/index]] - CI/CD, infrastructure as code
- [[architecture/index]] - system design patterns
- [[data-science/index]] - ML and analytics
- [[bi-analytics/index]] - BI tools and dashboards