Home Post BigData

Apache Kafka : Introduction and Overview

BigData

Mar 31, 2024

What is the publish/subscribe (pub/sub) pattern ?

Publisher-Subscriber (pub/sub) messaging is a pattern that is characterised by the "sender (publisher)" sending events to the "Pub/Sub (broker)" service without regard to how or when these events are to be processed.

"Pub/Sub (broker)" then delivers events to all the "receivers (subscribers)" that subscribe to certain classes of these messages.

The pub/sub messaging pattern decouples the publishers of the information from the subscribers to that information.

What is Kafka ?

Apache Kafka was developed as a publish/subscribe messaging system and is now often described as a "distributed commit log" or more recently as a "distributing streaming platform."

Kafka has now evolved as a distributed, partitioned, replicated commit log service to provide the functionality of a messaging system.

A distributed commit log similar to data within Kafka is stored durably, in order, and can be read deterministically.

In addition, the data can be distributed within the system to provide fault-tolerance as well as significant opportunities for scalability.

Kafka runs as a cluster comprised of one or more servers, each of which is called a "broker".

So, at a high level, producers send messages over the network to the Kafka cluster, which in turn serves them up to consumers.

Communication between the clients (producers/consumers) and the Kafka cluster is done with a simple, high-performance, language-agnostic "TCP protocol".

Messages

The unit of data within Kafka is called a "message". It is simply an "array of bytes" as far as Kafka is concerned.

A message can have an optional piece of metadata, which is referred to as a "key" (also an array of bytes).

Messages in Kafka are categorised into "topics".

Topics & Partitions

Kafka topics are the categories used to organise messages.

Each topic has a name that is "unique" across the entire Kafka cluster.

Topics are additionally broken down into a number of partitions.

Each partition is an ordered, immutable sequence of messages that is continually appended to - a commit log.

Messages are written to the commit log in an append-only fashion and are read in order from beginning to end.

The messages in the partitions are each assigned a sequential id number called the offset that uniquely identifies each message within the partition.

A topic typically has multiple partitions.

Only within a partition can message ordering be guaranteed, not across the entire topic.

However, messages sent by a producer to a particular topic partition will be appended in the order they are sent. "

Keep in mind that the number of partitions for a topic can only be increased, never decreased.

Fault-tolerance

Partitions can be replicated, such that different servers will store a copy of the same partition in case one server fails.

The host server of any one replica of each partition acts as the "leader" and zero or more servers hosting other replicas of the same partition act as "followers".

That also means, each server acts as a leader for some of its partitions and a follower for others.

The leader handles all read and write requests for the partition while the followers passively replicate the leader.

If the leader fails, one of the followers will automatically become the new leader.

For a topic with a replication factor of N, kafka will tolerate up to N-1 server failures without losing any messages committed to the log.

Scalability

Partitions are also the way that Kafka provides scalability.

A single topic can be scaled horizontally by distributing the hosting of its partitions on different servers within a Kafka cluster.

All producers must connect to the leader in order to publish messages, but consumers may fetch messages from either the leader or one of the followers.

Retention

The Kafka cluster retains all published messages, whether or not they have been consumed, for a configurable period of time.

Brokers are configured with a default retention setting for topics, either retaining messages for some period of time (e.g. 30 days) or until the partition reaches a certain size in bytes (e.g. 2 GB).

Once these limits are reached, messages are expired and deleted.

Individual topics can also be configured with their own retention settings so that messages are stored for only as long as they are useful.

Topics can also be configured as log compacted, which means that Kafka will retain only the last message produced with a specific key.

Kafka Cluster & Brokers

A single Kafka instance or node within a cluster is called a broker.

The broker receives messages from producers, assigns offsets to them, and writes the messages to storage on disk.

It also responds to fetch requests for partitions and sends messages that have been published.

Within a cluster of brokers, one broker also functions as the cluster controller (elected automatically from among the live members of the cluster).

The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures.

Producers and consumers

Two types of clients interact with the Kafka cluster: producers and consumers.

Producers

Producers create or produce new messages to a specific topic.

By default, the producer will balance messages over all partitions of a topic evenly.

Messages are assigned to a particular partition using the message key and a partitioner.

However, custom partitioners are also available that follow other business rules for mapping messages to partitions.

Consumers

The consumer subscribes to one or more topics and reads the messages in the order in which they were produced to each partition.

Consumers work as part of a "consumer group," which is one or more consumers that work together to consume a topic.

The group ensures that each partition is only consumed by one member (the consumer).

However, one consumer can consume messages from different partitions.

Consumers can horizontally scale to consume topics with a large number of messages.

If a single consumer fails, the remaining members of the group will reassign the partitions being consumed to take over for the missing member.

Queuing vs Pub-Sub

In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe, the message is broadcast to all consumers.

If all the consumer instances have the same consumer-group, then this works just like a traditional queue balancing load over the consumers.

If all the consumer instances have different consumer-groups, then this works like publish-subscribe and all messages are broadcast to all consumers.

History

Kafka was created to address the data pipeline problem at LinkedIn.

Kafka was released as an open source project on GitHub in late 2010.

NK Chauhan

NK Chauhan is a Principal Software Engineer with one of the biggest E Commerce company in the World.

Chauhan has around 12 Yrs of experience with a focus on JVM based technologies and Big Data.

His hobbies include playing Cricket, Video Games and hanging with friends.