Topics and Partitions¶

A topic is Kafka's logical channel for organizing records. Physically, each topic is split into one or more partitions -- ordered, append-only, immutable log segments stored on broker disks. Partitions are the fundamental unit of parallelism, storage, and replication in Kafka. Every design decision in a Kafka deployment -- throughput, ordering guarantees, consumer scaling, data retention -- traces back to how topics and partitions are configured.

Core Concepts¶

Topic as Logical Channel¶

A topic is a named feed of records. Producers write to topics; consumers read from topics. Topics are identified by name and exist across the cluster -- they are not bound to a single broker.

Topics can have any number of producers and consumers simultaneously
Multiple [[consumer-groups]] can read the same topic independently, each tracking its own offsets
Topics are created explicitly (kafka-topics.sh --create) or automatically when a producer first writes to them (if auto.create.topics.enable=true on the broker)
Topic names support [a-zA-Z0-9._-], max 249 characters. Avoid . and _ in the same cluster -- Kafka uses both as internal separators in metric names

Partition as Unit of Parallelism¶

Each partition is an independent, ordered log. Parallelism in Kafka scales with partition count:

Topic: user-events (4 partitions)

Partition 0:  [msg0] [msg1] [msg2] [msg3] ...
Partition 1:  [msg0] [msg1] [msg2] ...
Partition 2:  [msg0] [msg1] [msg2] [msg3] [msg4] ...
Partition 3:  [msg0] [msg1] ...

Each partition has its own offset sequence starting at 0
A [[consumer-groups|consumer group]] can have at most one consumer per partition -- so partition count is the upper bound on consumer parallelism
Each partition has one leader replica (handles all reads/writes) and zero or more follower replicas (passive replication)
Partitions are distributed across brokers in the cluster -- see [[broker-architecture]]

Message Ordering Guarantees¶

Kafka guarantees ordering only within a single partition. There is no global ordering across partitions.

Partition 0:  A -> B -> C        (order guaranteed: A before B before C)
Partition 1:  D -> E -> F        (order guaranteed: D before E before F)

Cross-partition: no guarantee on relative order of A vs D

If you need strict ordering for a set of related records (e.g., all events for a single user), you must route them to the same partition using a consistent key.

With max.in.flight.requests.per.connection > 1 (default: 5), out-of-order delivery is possible even within a partition if retries occur. To guarantee per-partition ordering with retries: - Use an idempotent producer (enable.idempotence=true, default since Kafka 3.0) -- this handles reordering internally - Or set max.in.flight.requests.per.connection=1 (reduces throughput)

See [[producer-patterns]] for idempotent and transactional producer configuration.

Key-Based Partitioning¶

The producer determines the target partition for each record:

1. Keyed messages (DefaultPartitioner):

from confluent_kafka import Producer

producer = Producer({"bootstrap.servers": "localhost:9092"})

# All records with key="user-42" go to the same partition
# Partition = murmur2(key_bytes) % num_partitions
producer.produce("user-events", key="user-42", value="login")
producer.produce("user-events", key="user-42", value="purchase")  # Same partition
producer.flush()

The default partitioner applies murmur2 hash to the serialized key bytes, then takes modulo by partition count. This is deterministic: same key always maps to the same partition (as long as partition count does not change).

2. Null keys = round-robin / sticky:

# No key -> round-robin distribution (pre-2.4) or sticky partitioner (2.4+)
producer.produce("metrics", value="cpu=80%")
producer.produce("metrics", value="mem=60%")  # May go to a different partition

Since Kafka 2.4, the sticky partitioner is the default for null-key records: the producer "sticks" to one partition until the current batch is full or linger.ms expires, then switches. This improves batching efficiency over pure round-robin.

3. Custom partitioner:

from confluent_kafka import Producer

def region_partitioner(key, partitions, _):
    """Route by region prefix in key: 'us-east:user-42' -> hash only region."""
    if key is None:
        return None  # Fall back to default
    region = key.split(b":")[0]
    return hash(region) % len(partitions)

# confluent-kafka doesn't directly support custom partitioner callbacks
# in Python -- use the Java client or implement key prefixing with
# default partitioner. For Python, a common workaround:
# encode routing info into the key itself.

In Java:

public class RegionPartitioner implements Partitioner {
    @Override
    public int partition(String topic, Object key, byte[] keyBytes,
                         Object value, byte[] valueBytes, Cluster cluster) {
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();
        if (keyBytes == null) return ThreadLocalRandom.current().nextInt(numPartitions);
        String region = new String(keyBytes).split(":")[0];
        return Utils.toPositive(Utils.murmur2(region.getBytes())) % numPartitions;
    }
}

Partition Storage: Segments¶

Each partition is stored as a directory on the broker's log directory. Inside, data is split into segments:

/kafka-logs/user-events-0/
    00000000000000000000.log        # Segment file (records)
    00000000000000000000.index      # Offset -> file position index
    00000000000000000000.timeindex  # Timestamp -> offset index
    00000000000000523417.log        # Next segment (starts at offset 523417)
    00000000000000523417.index
    00000000000000523417.timeindex
    leader-epoch-checkpoint
    partition.metadata

The active segment is the one currently being written to
Segment rotation occurs when segment.bytes (default: 1 GB) is reached or segment.ms (default: 7 days) elapses
Only closed (inactive) segments are eligible for deletion or compaction
Smaller segments = more frequent cleanup but more file handles; larger segments = delayed cleanup

Partition Reassignment¶

Partitions can be moved between brokers for load balancing or during broker decommission.

Generate reassignment plan:

# Create a JSON file listing topics to reassign
cat > topics.json << 'EOF'
{"topics": [{"topic": "user-events"}], "version": 1}
EOF

# Generate a reassignment plan (move to brokers 1, 2, 3)
kafka-reassign-partitions.sh --generate \
  --topics-to-move-json-file topics.json \
  --broker-list "1,2,3" \
  --bootstrap-server localhost:9092

Execute reassignment:

# Save the generated plan to reassignment.json, then execute
kafka-reassign-partitions.sh --execute \
  --reassignment-json-file reassignment.json \
  --bootstrap-server localhost:9092

# Monitor progress
kafka-reassign-partitions.sh --verify \
  --reassignment-json-file reassignment.json \
  --bootstrap-server localhost:9092

Throttle reassignment to limit replication bandwidth:

kafka-reassign-partitions.sh --execute \
  --reassignment-json-file reassignment.json \
  --throttle 50000000 \
  --bootstrap-server localhost:9092
# 50 MB/s throttle -- prevents reassignment from saturating network

Preferred Leader Election¶

Each partition has a preferred leader -- the first broker in the replica list. Over time, leaders can shift (broker restarts, failures). To rebalance leadership:

# Trigger preferred leader election for all partitions
kafka-leader-election.sh --election-type PREFERRED --all-topic-partitions \
  --bootstrap-server localhost:9092

# For a specific topic
kafka-leader-election.sh --election-type PREFERRED \
  --topic user-events \
  --bootstrap-server localhost:9092

# Unclean leader election (risk of data loss -- only for non-critical topics)
kafka-leader-election.sh --election-type UNCLEAN \
  --topic user-events --partition 0 \
  --bootstrap-server localhost:9092

auto.leader.rebalance.enable=true (default) triggers automatic preferred leader election when leader imbalance exceeds leader.imbalance.percentage.per.broker (default: 10%).

Topic Configuration¶

cleanup.policy¶

Controls how old data is removed from a topic.

Policy	Behavior	Use Case
`delete` (default)	Remove segments older than `retention.ms` or larger than `retention.bytes`	Event streams, logs, metrics
`compact`	Keep only the latest value per key; tombstone (null value) removes key	State snapshots, changelogs, caches
`compact,delete`	Compact first, then delete segments older than retention	Compacted topics with a retention ceiling

Compaction internals: - The log cleaner thread scans "dirty" (uncompacted) segments - For each key, only the record with the highest offset survives - A tombstone (key with null value) marks a key for deletion; removed after delete.retention.ms (default: 24h) - min.compaction.lag.ms -- minimum time before a record is eligible for compaction (prevents compacting records that consumers haven't processed yet) - max.compaction.lag.ms -- maximum time before compaction is guaranteed to run - min.cleanable.dirty.ratio (default: 0.5) -- compaction starts when 50%+ of the log is dirty

# Create a compacted topic
kafka-topics.sh --create --topic user-profiles \
  --partitions 6 --replication-factor 3 \
  --config cleanup.policy=compact \
  --config min.compaction.lag.ms=3600000 \
  --config delete.retention.ms=86400000 \
  --bootstrap-server localhost:9092

Retention Configuration¶

Parameter	Default	Description
`retention.ms`	604800000 (7 days)	Time-based retention; `-1` = infinite
`retention.bytes`	-1 (unlimited)	Size-based retention per partition
`segment.bytes`	1073741824 (1 GB)	Max size of a single segment file
`segment.ms`	604800000 (7 days)	Max time before active segment is rolled
`min.compaction.lag.ms`	0	Minimum delay before record is compactable
`max.compaction.lag.ms`	9223372036854775807	Maximum delay before compaction runs
`delete.retention.ms`	86400000 (24h)	How long tombstones survive after compaction
`message.timestamp.type`	`CreateTime`	`CreateTime` (producer sets) or `LogAppendTime` (broker sets)

Retention is evaluated per-segment, not per-record. A segment is deleted when all records in it exceed the retention threshold.

Practical Patterns¶

kafka-topics.sh Commands¶

Create a topic:

kafka-topics.sh --create --topic orders \
  --partitions 12 --replication-factor 3 \
  --config retention.ms=259200000 \
  --config cleanup.policy=delete \
  --bootstrap-server kafka1:9092

Describe a topic (partitions, replicas, ISR, configs):

kafka-topics.sh --describe --topic orders \
  --bootstrap-server kafka1:9092

# Output:
# Topic: orders  PartitionCount: 12  ReplicationFactor: 3  Configs: retention.ms=259200000
#   Topic: orders  Partition: 0  Leader: 1  Replicas: 1,2,3  Isr: 1,2,3
#   Topic: orders  Partition: 1  Leader: 2  Replicas: 2,3,1  Isr: 2,3,1
#   ...

List all topics:

kafka-topics.sh --list --bootstrap-server kafka1:9092

# Exclude internal topics
kafka-topics.sh --list --exclude-internal --bootstrap-server kafka1:9092

Alter partition count (increase only):

kafka-topics.sh --alter --topic orders --partitions 24 \
  --bootstrap-server kafka1:9092
# WARNING: breaks key-based partition assignment for existing keys

Alter topic configs:

# Using kafka-configs.sh (preferred for config changes)
kafka-configs.sh --alter --entity-type topics --entity-name orders \
  --add-config retention.ms=86400000,segment.bytes=536870912 \
  --bootstrap-server kafka1:9092

# Remove a config override (revert to broker default)
kafka-configs.sh --alter --entity-type topics --entity-name orders \
  --delete-config retention.ms \
  --bootstrap-server kafka1:9092

# Describe current configs
kafka-configs.sh --describe --entity-type topics --entity-name orders \
  --bootstrap-server kafka1:9092

Delete a topic:

kafka-topics.sh --delete --topic orders \
  --bootstrap-server kafka1:9092
# Requires delete.topic.enable=true on broker (default: true since Kafka 1.0)
# Deletion is asynchronous -- data is removed in the background

Programmatic Topic Management (Python)¶

from confluent_kafka.admin import AdminClient, NewTopic, ConfigResource

admin = AdminClient({"bootstrap.servers": "localhost:9092"})

# Create topic
topic = NewTopic(
    "user-events",
    num_partitions=12,
    replication_factor=3,
    config={
        "cleanup.policy": "compact,delete",
        "retention.ms": "604800000",
        "min.compaction.lag.ms": "3600000",
        "segment.bytes": "536870912",
    },
)
futures = admin.create_topics([topic])
for topic_name, future in futures.items():
    try:
        future.result()  # Block until complete
        print(f"Created topic: {topic_name}")
    except Exception as e:
        print(f"Failed to create {topic_name}: {e}")

# Describe topic config
resource = ConfigResource("TOPIC", "user-events")
futures = admin.describe_configs([resource])
for res, future in futures.items():
    configs = future.result()
    for key, config in configs.items():
        print(f"  {key} = {config.value}")

# List topics
metadata = admin.list_topics(timeout=10)
for topic_name in metadata.topics:
    print(f"Topic: {topic_name}, Partitions: {len(metadata.topics[topic_name].partitions)}")

Partition Count Selection Heuristic¶

Target throughput:    100 MB/s
Per-partition write:  ~10 MB/s (single partition, single producer)
Consumer instances:   8 (in one consumer group)

Minimum partitions = max(100/10, 8) = 10
Recommended:         12 (round up, leave room for growth)

Start with fewer partitions and increase later. You cannot decrease partition count without recreating the topic.

Gotchas¶

Increasing partitions breaks key affinity. murmur2(key) % 12 != murmur2(key) % 24. After increasing partitions, records with the same key may land in different partitions. If ordering matters, either plan partition count from the start or use a custom partitioner that handles growth.
Partition count cannot be decreased. The only way to reduce is to create a new topic with fewer partitions and migrate data using MirrorMaker, Kafka Connect, or a consumer-producer bridge.
Compaction requires non-null keys. If cleanup.policy=compact, every record must have a key. Null-key records will cause compaction to skip those records entirely.
Active segment is never deleted or compacted. Retention and compaction only apply to closed segments. If segment.bytes=1GB and you produce 500 MB/day, the active segment won't roll for ~2 days, delaying cleanup.
Topic deletion is asynchronous and may leave state. After --delete, partitions are marked for deletion but data removal happens in the background. The topic name becomes unavailable immediately, but disk space is reclaimed later.
retention.bytes is per-partition, not per-topic. A topic with 12 partitions and retention.bytes=1GB can retain up to 12 GB total.
Unclean leader election risks data loss. If unclean.leader.election.enable=true (default: false since Kafka 0.11), an out-of-sync replica can become leader, losing unreplicated messages.
Cross-partition joins require co-partitioning. If you need to join two topics by key, both must have the same partition count and use the same partitioner. See [[consumer-groups]] for RangeAssignor requirements.

Quick Reference¶

Key Topic-Level Configuration¶

Parameter	Default	Description
`cleanup.policy`	`delete`	`delete`, `compact`, or `compact,delete`
`retention.ms`	604800000 (7d)	Time-based retention
`retention.bytes`	-1	Per-partition size-based retention
`segment.bytes`	1073741824 (1GB)	Segment file size
`segment.ms`	604800000 (7d)	Max age of active segment
`min.compaction.lag.ms`	0	Min delay before compaction eligibility
`max.compaction.lag.ms`	MAX_LONG	Max delay before forced compaction
`delete.retention.ms`	86400000 (24h)	Tombstone TTL after compaction
`min.cleanable.dirty.ratio`	0.5	Dirty log ratio to trigger compaction
`max.message.bytes`	1048588 (~1MB)	Max record size for this topic
`min.insync.replicas`	1	ISR count required for `acks=all`
`unclean.leader.election.enable`	false	Allow out-of-sync replica as leader
`message.timestamp.type`	`CreateTime`	`CreateTime` or `LogAppendTime`

Essential CLI Commands¶

# Create
kafka-topics.sh --create --topic T --partitions N --replication-factor R \
  --bootstrap-server HOST:9092

# Describe
kafka-topics.sh --describe --topic T --bootstrap-server HOST:9092

# List
kafka-topics.sh --list --bootstrap-server HOST:9092

# Alter partitions (increase only)
kafka-topics.sh --alter --topic T --partitions N --bootstrap-server HOST:9092

# Alter configs
kafka-configs.sh --alter --entity-type topics --entity-name T \
  --add-config KEY=VALUE --bootstrap-server HOST:9092

# Delete
kafka-topics.sh --delete --topic T --bootstrap-server HOST:9092

# Preferred leader election
kafka-leader-election.sh --election-type PREFERRED --all-topic-partitions \
  --bootstrap-server HOST:9092

# Reassign partitions
kafka-reassign-partitions.sh --execute \
  --reassignment-json-file plan.json --bootstrap-server HOST:9092

Official Documentation¶

Topics and Logs - core topic/partition model
Topic-Level Configs - full configuration reference
Log Compaction - compaction semantics and guarantees
Operations: Adding/Removing Topics - CLI management
Partition Reassignment - rebalancing partitions across brokers