Apache Kafka : Best Practices-Topic, Partitions, Consumers and Producers
- 4.3/5
- 3523
- Jul 20, 2024
How to choose Number of Partitions in Kafka
More partitions means higher throughput
A topic partition is the unit of parallelism in Kafka on both the producer and the consumer side.
Writes to different partitions can be done fully in parallel.
On the other hand a partition will always be consumed completely by a single consumer.
Therefore, in general, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve.
Picking partitions based on throughput
Let's say our target throughput is t and throughout on a single partition for production is p and consumption is c.
Then we need to have at least max(t/p, t/c) partitions.
The per-partition throughput for producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc.
However, in general, one can produce at 10s of MB/sec on just a single partition as shown in this benchmark.
The consumer throughput is often application dependent since it corresponds to how fast the consumer logic can process each message.
So if we want to be able to write and read 1 GBps from a topic, and we know each consumer can only process 50 MBps, then we know we need at least 20 partitions.
This way, we can have 20 consumers reading from the topic and achieve 1 GBps.
Over-partition is better
When publishing a keyed message, Kafka deterministically maps the message to a partition based on the hash of the key.
This provides a guarantee that messages with the same key are always routed to the same partition.
If the number of partitions changes, such a guarantee may no longer hold.
To avoid this situation, a common practice is to over-partition a bit.
Negative impact of Higher number of partitions
Higher open file handle limit
Each partition maps to a directory in the broker file system.
Within that log directory, there will be two files (for index and actual data) per log segment.
In Kafka, each broker opens a file handle of both the index and the data file of every log segment.
So, the more partitions, the higher that one needs to configure the open file handle limit in the underlying operating system.
This is mostly just a configuration issue.
Avoid overestimating
Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for metadata updates and leadership transfers.
Always remember, starting small and expanding as needed is easier than starting too large.
How to choose Log segment size in Kafka
The log retention settings operate on log segments, not individual messages.
If you have specified a value for both log.retention.bytes and log.retention.ms, messages may be removed when either criteria is met.
A smaller log segment size (log.segment.bytes) means that files must be closed and allocated more often, which reduces the overall efficiency of disk writes.
Adjusting the size of the log segments can be important if topics have a low produce rate.
For example, if a topic receives only 100 MB per day of messages, and log.segment.bytes is set to the default, it will take 10 days to fill one segment.
As messages cannot be expired until the log segment is closed, if log.retention.ms is set to 604800000 (1 week), there will actually be up to 17 days of messages retained until the closed log segment expires.
This is because once the log segment is closed with the current 10 days of messages, that log segment must be retained for 7 days before it expires based on the time policy as the segment cannot be removed until the last message in the segment can be expired).
Note that all retention is performed for individual partitions, not the topic.