Apache Spark Core Architecture¶

Apache Spark is an open-source distributed data processing framework. Key advantages over MapReduce: in-memory computation, unified API for batch/streaming/ML/graph, support for Python/Scala/Java/R.

Cluster Components¶

Component	Role
Cluster Manager	Allocates resources. Options: Standalone, YARN, Kubernetes, Mesos
Driver	Master process - converts app into tasks, schedules, tracks progress
SparkContext / SparkSession	Entry point for interacting with the cluster
Worker Node	Server/VM running Executor processes
Executor	Process on Worker that executes Tasks
Task	Minimum unit of work - one operation on one Partition
Partition	Minimum logical data unit. Default shuffle partitions: 200

Deploy Modes¶

Client mode: Driver on submitting machine; stopping client kills all executors
Cluster mode: Driver and executors both on cluster; client only for submission
Local mode: All on one JVM. --master local[*] uses all cores. For dev/testing

Execution Hierarchy¶

Application -> Job(s) -> Stage(s) -> Task(s)

Job: created by an Action call
Stage: set of Tasks without Shuffle boundary
Task: one operation on one Partition
Shuffle: data redistribution across nodes. Triggered by join, groupBy, distinct, repartition

Spark Stack¶

Module	Purpose
Spark Core	Task scheduling, memory management, fault tolerance
Spark SQL + DataFrames	Structured data via SQL or DataFrame API
Spark Streaming	Real-time micro-batch processing
MLlib	Distributed ML
GraphX	Graph processing

Data Structures¶

RDD (Resilient Distributed Dataset)¶

Lowest-level API. Immutable, lazy evaluation, fault-tolerant via DAG lineage. Use for unstructured data or media files.

rdd = sc.parallelize([1, 2, 3, 4, 5])

DataFrame¶

Distributed named-column collection. Primary PySpark object. Catalyst optimizer for query planning.

Dataset¶

Typed DataFrame (Scala/Java only). Not available in Python.

Lazy Evaluation¶

Spark builds a logical plan optimized by Catalyst Optimizer (predicate pushdown, constant folding, filter reordering). Execution starts only when an Action is called.

Transformations (lazy): - Narrow (no shuffle): map, filter, select, withColumn - Wide (trigger shuffle): join, groupBy, distinct, repartition, orderBy

Actions (trigger execution):

df.show()              # display rows
df.count()             # count rows
df.collect()           # all to driver (OOM risk!)
df.toPandas()          # convert to Pandas (all to driver)
df.write.parquet(path) # write to file

Catalyst Optimizer Pipeline¶

Logical Planning: parse -> analyze -> optimize (predicate pushdown, constant folding)
Physical Planning: generate multiple plans -> cost-based selection
Tungsten: whole-stage code generation for CPU efficiency

spark-submit¶

spark-submit \
  --master spark://master:7077 \
  --deploy-mode cluster \
  --executor-memory 4g \
  --executor-cores 4 \
  --num-executors 10 \
  app.py [args]

PySpark vs Pandas¶

Feature	PySpark	Pandas
Execution	Distributed	Single machine
Evaluation	Lazy	Eager
Mutability	Immutable	Mutable
Scale	TB+	Limited by RAM

Gotchas¶

collect() and toPandas() bring all data to driver - OOM risk on large datasets
UDFs bypass Catalyst optimizer and cause Python-JVM serialization overhead
inferSchema=True reads entire CSV to determine types - slow on large files
Spark Streaming Structured API does not allow terminal actions mid-pipeline
Spark on Kubernetes production-ready since Spark 3.1 (March 2021)