Skip to content

Apache Kafka

Apache Kafka is a distributed event streaming platform. It decouples producers from consumers via a persistent message broker, enabling high-throughput, fault-tolerant data pipelines and real-time analytics.

Why PubSub

Direct client-to-client communication fails when: - Consumer goes offline (producer has nowhere to send) - Multiple consumers need the same events - Producer must know all consumers in advance

Solution: Publishers send to broker, subscribers read from broker.

Alternatives: RabbitMQ, Apache Pulsar, AWS Kinesis, GCP Pub/Sub

Architecture

Component Role
Brokers Servers storing messages
Producers Send messages to topics
Consumers Read messages from topics
Topics Logical separation of message types
Partitions Topics split into ordered partitions; each message gets an offset
Consumer groups Multiple consumers sharing load on a topic

Key Properties

  • High throughput, horizontally scalable
  • Persistent storage (configurable retention: days to months)
  • Messages are ordered within a partition (not across partitions)
  • Consumer groups enable parallel consumption

Use Cases

  • Messaging between services
  • Real-time monitoring and analytics
  • Log aggregation
  • Metric collection
  • Event sourcing

What Kafka is NOT For

  • Long-term storage (use a proper DWH)
  • Building training datasets (move data to storage first)
  • Complex transformations (use Spark/Flink downstream)

Tooling

  • Kafka Tool / Offset Explorer - GUI to browse brokers, topics, partitions
  • Confluent Platform - managed Kafka with additional tools (Schema Registry, ksqlDB)
  • KafkaStreams - lightweight stream processing library

Integration with Spark

# Spark Structured Streaming from Kafka
df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "host:9092") \
    .option("subscribe", "topic_name") \
    .load()

Gotchas

  • Messages are ordered within a partition only - not across partitions
  • Consumer lag must be monitored (consumers falling behind producers)
  • Partition count cannot be decreased after creation
  • Retention is time-based or size-based - data disappears after retention period

See Also