Database Selection and Design¶

Choosing the right database is an architectural decision with long-term consequences. Consider data type, consistency requirements, performance needs, team skills, and budget.

OLTP vs OLAP¶

Aspect	OLTP	OLAP
Purpose	Real-time transactions	Analytical queries, reporting
Transactions	Many short	Few long-running
Examples	PostgreSQL, MySQL, Oracle	ClickHouse, BigQuery, Greenplum
Use case	Service operational data	DWH, analytics, reports

Selection by Data Type¶

Data Type	DB Type	Example	Best For
Structured	Relational	PostgreSQL	Orders, users, transactions
Documents/JSON	Document	MongoDB	Hierarchical, semi-structured
High-volume fast-changing	Key-Value	Redis	Sessions, caching
Analytics/reports	Columnar	ClickHouse, Cassandra	TB/PB aggregations
Binary/large files	Object storage	S3, MinIO	Images, video, audio
Relationships	Graph	Neo4j	Social networks, recommendations

Decision Tree¶

Need ACID transactions?
  Yes -> PostgreSQL (or MySQL)
Need flexible schema?
  Yes -> MongoDB
Need high write throughput (>500K TPS)?
  Yes -> Cassandra / ScyllaDB
Need sub-ms latency for simple keys?
  Yes -> Redis
Need full-text search?
  Yes -> Elasticsearch
Need relationship traversal?
  Yes -> Neo4j
Need analytical aggregations?
  Yes -> ClickHouse / BigQuery

Database Deep Dives¶

PostgreSQL¶

High write throughput, fast reads, JSON/XML support. Requires careful schema design. - Use for: order processing, ACID-critical data, OLTP workloads - OLAP extensions: GreenPlum, ClickHouse for analytical layer

Redis¶

In-memory, key-value. Extremely fast reads/writes. Built-in TTL. - Use for: sessions, caching, rate limiting, leaderboards - Caveat: resource-intensive for large datasets (everything in RAM)

MongoDB / CouchDB¶

Flexible schema (JSON documents). Fast development. - Use for: hierarchical data, rapid prototyping, schema evolution - Caveat: not ideal for complex transactions

Cassandra¶

Column families, TB/PB scale. Excellent horizontal scalability. - Use for: analytics, time-series, IoT data - Caveat: not for frequent writes or complex transactions

S3 / MinIO¶

Object storage. Scalable, granular access control. - Use for: unstructured files (images, video, documents) - Caveat: can be expensive at scale

Database Design Fundamentals¶

Normalization¶

1NF - atomic values, no repeating groups
2NF - all non-key attributes depend on entire primary key
3NF - no transitive dependencies

Keys¶

Primary Key - unique row identifier
Foreign Key - reference to PK in another table
Composite Key - multiple columns as PK
Surrogate - auto-generated (no business meaning)
Natural - business-meaningful (email, SSN)

Indexes¶

Type	Best For
B-Tree	Range queries, equality (default)
Hash	Fast equality lookups
GIN	Full-text search, JSONB
Partial	Subset of rows (WHERE clause)

ACID Transactions¶

Atomicity - all or nothing
Consistency - valid state to valid state
Isolation - concurrent transactions don't interfere
Durability - committed data survives failures

Isolation Levels¶

Level	Dirty Reads	Non-Repeatable	Phantom
Read Uncommitted	Yes	Yes	Yes
Read Committed	No	Yes	Yes
Repeatable Read	No	No	Yes
Serializable	No	No	No

PostgreSQL default: Read Committed.

ORM Considerations¶

When to use: Rapid development, standard CRUD, reducing boilerplate. One migration cut 1500 classes of manual SQL to 5x less code.

When NOT to use: Reports (raw SQL), high-load critical paths, complex analytical queries.

Hybrid approach: ORM for business logic, raw SQL for reports. Migration tools: Flyway, Liquibase (Java), Entity Framework migrations (.NET).

Query Optimization¶

EXPLAIN ANALYZE - understand execution plan
Index usage - ensure queries hit indexes
**Avoid SELECT *** - retrieve only needed columns
JOIN optimization - proper order, indexes on join columns
Connection pooling - PgBouncer for PostgreSQL
Partitioning - split large tables (range/list/hash)
Materialized views - pre-computed for complex reports

Scaling Patterns¶

Read replicas - distribute read load
Sharding - split data across DBs by key
Partitioning - split tables within single DB
Hot/cold storage - recent data on SSD, old data on cheap storage

Gotchas¶

Shared database between microservices is an anti-pattern - tight coupling, schema coordination problems
Premature NoSQL - start with PostgreSQL unless you have clear NoSQL requirements
Redis as primary store without persistence config = data loss on restart
N+1 query problem - ORM lazy loading causes N+1 queries in loops. Use eager loading or batch fetching
Schema migration without rollback plan - every migration needs corresponding rollback script
Confusing sharding with replication - sharding splits data across machines; replication copies same data to multiple machines