Kafka: The Distributed System

Prasanth
5 min readMay 20, 2020

Apache Kafka is a distributed publish-subscribe messaging system. It was developed at LinkedIn corporate system and had a initial release on January 2011. Later on it became a part of Apache project. It is currently used by numerous web-scale Internet enterprises such as LinkedIn, Twitter, Airbnb, Yahoo, Netflix, Uber, Spotify, Pinterest, Tumblr, Mozilla, Amazon Web Services, etc. Kafka was written in Java and Scala and because of Scala it can effectively run on any OS that supports the JVM. The purpose of the Kafka project is to provide a unified, high-throughput, and low-delay system platform for real-time data processing. In this blog we are going to discuss how Kafka preserve following features:

  • Fault-Tolerant
  • Highly Available
  • Recoverable
  • Consistent
  • Scalable
  • Predictable Performance
  • Secure

Fault-Tolerant

First of all fault tolerance means, the ability to recover from component failures without performing incorrect actions. Kafka implements a leader and follower model. So, for every partition, One Broker is elected as a leader. And the Leader takes care of all client interactions. When a producer is willing to send some data. It connects to the Leader and starts sending data. It is Leader’s responsibility to receive the message, store it in local disk and send back an acknowledgment to the producer. Similarly, when a consumer is willing to read data, it sends a request to the leader. It is leader’s responsibility to send requested data back to the consumer.
For every partition, we have a leader, and the leader takes care of all requests and responses. So, if we create a topic with the replication factor (Term used for making multiple copies) set to three, A leader of the Topic is already maintaining the first copy. We need two more copies. So, Kafka will identify two more brokers as followers to make two copies. These Followers will copy the data from a leader. They don’t talk to producer or consumer. They just copy data from a Leader. So we can easily recover from component failures.

Highly Available

As a distributed cluster, Kafka brokers ensure high availability to process new events. Topic has replication factor to support not loosing data in case of broker failure. You need at least 3 brokers to ensure availability and a replication factor set to 3 for each topic, so no data should be lost. Therefore we can restore operations, permitting it to resume providing services even when some components have failed.

Recoverable

Now assume that Broker 2 has failed. Broker 1 and 3 can still serve the data for topic 1. So a replication factor of 3 is always a good idea since it allows for one broker to be taken down for maintenance purposes and also for another one to be taken down unexpectedly. If we are planning to use Kafka as a storage we also need to be aware of the configurable retention period for every topic. If we don’t take care of this setting, we might lose our data according to the docs (For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space).

Consistent

It can be said that the Order’s Customer contact information is eventually consistent with the Account’s Customer contact information, by way of Kafka. Eventually consistent means it achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Kafka makes the following guarantees about data consistency:

(1) Messages sent to a topic partition will be appended to the commit log in the order they are sent.

(2) A single consumer instance will see messages in the order they appear in the log.

(3) A message is ‘committed’ when all in sync replicas Fault-Tolerant.

Scalable

One of the cited advantages of Kafka is its scalability. It can operate correctly even as some aspect of the system is scaled to a larger size. Adding consumers to a consumer group to enable load balancing requires the consumer groups to have the topic partitions re-balanced to ensure the load is distributed to the consumers. This part of the Kafka architecture keeps the broker very simple and maintain the consumer scaling more effectively.

Predictable Performance

Apache Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds. It has the ability to provide desired responsiveness in a timely manner. In-order to offer a predictable performance, Kafka represents following Ideas:

(1) Implement a distributed architecture and take a partitioned approach.

(2) Implement an intelligent load balancing to redirect user request to their respective services and then to their database partitions.

(3) Design our system to ensure that the users are served from their nearest data center at the lowest network delay.

(4) Maintain an in-memory cache of the active users to boost up the performance.

The key idea of the above approach is that a user A is always served from a database partition X. While database partition X serves a finite number of users, another database partition Y serves a different set of users. The overall workload is shared by different partitions. So Kafka offers a predictable performance to the users for the write operations because they always interact with the same database partition.

Secure

Kafka Security has three components:

(1) Encryption of data in-flight using SSL / TLS: This allows our data to be encrypted between our producers and Kafka and our consumers and Kafka.

(2) Authentication using SSL or SASL: This allows our producers and our consumers to authenticate to our Kafka cluster, which verifies their identity. It offers authentication of connections from brokers to Zoo Keeper.

(3) Authorization using ACLs: Once our clients are authenticated, our Kafka brokers can run them against access control lists (ACL) to determine whether or not a particular client would be authorized to write or read to some topic.

Basically, Apache Kafka plays the role as an internal middle layer, which enables our back-end systems to share real-time data feeds with each other through Kafka topics. Generally, any user or application can write messages pertaining to any topic, as well as read data from any topics with a standard Kafka setup. However, it is a required to implement Kafka security when moves towards a shared tenancy model while multiple teams and applications use the same Kafka Cluster, or also when Kafka Cluster starts on boarding some critical and confidential information.

This is the end of this blog.

Thank you!

Did you find this blog post helpful? Can you help me out by sharing this blog post?

--

--