Skip to content

Apache Spark and Big Data Processing

When data exceeds single-machine memory, Spark distributes computation across clusters. Understanding Spark's execution model is critical for performance.

Architecture

  • SparkContext: entry point for functionality
  • Driver: orchestrates execution
  • Worker Node -> Executor -> Task: execution hierarchy
  • Partition: unit of data parallelism
  • Job -> Stage -> Task: computation hierarchy

Core Optimization Rules

  • 2-3 tasks per CPU core for optimal parallelism
  • ~128 MB per partition recommended
  • Monitor data skew via Spark UI
  • Push filters as early as possible (predicate pushdown)
  • Avoid Python UDFs (break Catalyst optimizer) - use built-in functions

Shuffle

Data redistribution across nodes. Most expensive operation.

Triggers: groupBy, join, distinct, repartition, orderBy

Minimizing shuffle: - Filter before joining - Broadcast small tables - Use coalesce (no shuffle) instead of repartition when reducing partitions

Key Techniques

Broadcast Join

Send small table to all executors, avoid shuffle.

from pyspark.sql.functions import broadcast
result = big_df.join(broadcast(small_df), 'key')

Use when one side is small (< 10MB default, configurable).

Caching

df.cache()     # in memory
df.persist()   # configurable storage level
df.unpersist() # free memory

Cache DataFrames that are reused multiple times. Don't cache everything - memory is limited.

Repartition vs Coalesce

df.repartition(100)     # full shuffle, any number of partitions
df.repartition('key')   # partition by column (useful before groupBy)
df.coalesce(10)         # reduce partitions without shuffle

Data Skew

Problem: one partition much larger than others -> one task takes disproportionately long.

Detection: Spark UI shows task durations. If max >> median, you have skew.

Solutions: - Salting: add random prefix to skewed key, join with exploded dimension - Broadcast join: for skewed side if small enough - Repartition by different key: redistribute data

Small File Problem

Too many small files = too many tasks = scheduling overhead.

Fix: compact files via coalesce() or repartition before writing.

df.coalesce(10).write.parquet('output/')

Executor Sizing

  • Memory: too small = spill to disk (slow), too large = GC pauses
  • Cores: 5 cores per executor is a good default
  • Balance: more executors with moderate resources vs fewer with more

Gotchas

  • Spark is lazy - transformations don't execute until an action (collect, count, write)
  • collect() brings all data to driver - OOM for large datasets. Use take() or show()
  • Python UDFs are 10-100x slower than built-in Spark functions due to serialization overhead
  • Joins on skewed keys can run forever without intervention
  • Always check partition count after reading: df.rdd.getNumPartitions()

See Also