Cloud Data Platforms¶

Cloud platforms separate compute from storage, enabling elastic scaling and pay-per-use pricing. The "Big Three" (AWS, GCP, Azure) plus Snowflake provide the foundation for modern data engineering.

Service Model Comparison¶

On-Premise	AWS	GCP	Purpose
HDFS	S3	Cloud Storage	Object storage
Spark	EMR	Dataproc	Distributed compute
Kafka	MSK / Kinesis	Pub/Sub	Event streaming
PostgreSQL	RDS	Cloud SQL	Managed RDBMS

Cloud DWH Options¶

System	Type	Key Feature
Athena	Query service	Managed Presto, queries S3 without loading
Redshift	MPP DWH	Closest to Greenplum architecture
BigQuery	Serverless DWH	No compute control - true serverless
Snowflake	Separated compute/storage	Spins up clusters per query, data in S3
Presto/Trino	Federated query engine	Queries data in source systems without ETL

S3 - Simple Storage Service¶

5-10x cheaper than HDFS
99.999999999% durability (11 nines)
Practically unlimited elasticity
Separates storage from compute
Access via boto3 (Python), CLI, or Spark (s3a://)

Cloud Service Models for DE¶

Approach	Examples	DE Responsibility
IaaS	Self-managed ClickHouse on VMs	OS, database, monitoring, backups
PaaS	Managed ClickHouse, RDS	Service configuration
SaaS	BigQuery, Snowflake	Query writing only

Snowflake Architecture¶

Three layers: Storage, Compute, Services
No direct access to storage layer
Automatic micro-partitioning
Virtual clusters (warehouses) for different workloads
Pay per query compute time

VM and Disk Best Practices¶

Disk performance scales with size (QoS). A 50GB SSD may be slow because IOPS are throttled proportionally. Fix: increase disk size for more IOPS.

Disk Type	Best For
HDD	Cheapest, archives
SSD	Dev/test (cost-effective)
High-IOPS SSD	Production workloads
Local NVMe	Maximum performance, no replication

Cost Management¶

Always pause/stop unused VMs, databases, K8s clusters
Transient clusters for one-off jobs - create, use, destroy
HDD for dev/test, SSD for staging, High-IOPS for production
Monitor QoS limits - small disks have lower IOPS quotas

Cloud-Native Data Pattern¶

Storage Layer: S3 (or compatible)
Compute Layer: Kubernetes (Spark, Presto run here)

Separation enables elastic compute independent of storage.

Gotchas¶

Redshift nodes include compute + storage by default; Snowflake separates natively
Vertical scaling requires VM restart (downtime)
Always configure firewall - any VM with public IP is a target
Network latency kills cross-datacenter cluster performance with synchronous replication
BigQuery: no compute control means unexpected costs on complex queries