MapReduce¶

MapReduce is a programming model for distributed data processing. Two user-defined operations: Map (transform/filter) and Reduce (aggregate). The framework guarantees all records with the same key arrive at the same Reducer.

Paradigm¶

Map:    (K1, V1) -> list(K2, V2)    -- transform each record
Reduce: (K2, list(V2)) -> list(K3, V3)  -- aggregate by key

Execution Pipeline¶

Input -> Split -> Map -> Spill -> Sort/Merge -> Shuffle -> Reduce -> Output

Split: InputFormat splits data into chunks. Each split -> one Mapper
Map: Each Mapper processes split via RecordReader. Output to circular buffer
Spill: Buffer fills -> data spilled to local disk (NOT HDFS). Partitioned by hash(key) % numReducers
Sort/Merge: Mapper sorts and merge-sorts spill files per Reducer partition
Shuffle: Reducers pull data from Mappers (v2 reversed direction from v1)
Reduce: Merge-sort received files, process sorted key-value stream
Output: One output file per Reducer on HDFS

Hadoop Streaming (Python)¶

hadoop jar $HADOOP_MAPRED_HOME/hadoop-streaming.jar \
  -mapper mapper.py \
  -reducer reducer.py \
  -file /local/path/mapper.py \
  -file /local/path/reducer.py \
  -input /hdfs/input/ \
  -output /hdfs/output/

Key Details¶

Split != HDFS block: Splits are logical (can be from DB, S3); blocks are physical
InputFormat types: TextInputFormat (default), NLineInputFormat (N lines per mapper), DBInputFormat
Number of Mappers: determined by InputFormat split logic, not directly configurable
Number of Reducers: configurable via mapreduce.job.reduces
_SUCCESS file: empty marker indicating job completion. Files starting with _ or . ignored by subsequent jobs
Output directory must NOT exist before job launch (safety against overwriting)

MapReduce v1 vs v2¶

v1: Mappers pushed data to Reducers -> DDoS on Reducers
v2: Reducers pull from Mappers (reversed direction)

Gotchas¶

Container initialization overhead dominates for small datasets (40s for 100KB with 31 mappers vs 9s with 3)
Split is NOT the same as HDFS block - different abstraction layers
Hadoop Streaming (Python) lacks some Java MR features (no setup/cleanup hooks)
MapReduce is legacy - prefer Spark for new workloads