What is Apache Kafka?

Apache Kafka is an open-source distributed publish-subscribe messaging platform that has been purpose-built to handle real-time streaming data for distributed streaming, pipelining, and replay of data feeds for fast, scalable operations.

Kafka is a broker based solution that operates by maintaining streams of data as records within a cluster of servers. Kafka servers can span multiple data centers and provide data persistence by storing streams of records (messages) across multiple server instances in topics. A topic stores records or messages as a series of tuples, a sequence of immutable Python objects, which consist of a key, a value, and a timestamp.

Use cases for Apache Kafka

Apache Kafka is one of the fastest growing, open source messaging solutions in the market today. This is mainly due to the architectural design pattern that provides a superior logging mechanism for distributed systems.

Being purpose-built for real-time log streaming, Apache Kafka is ideally suited for applications that need:

  • Reliable data exchange between disparate components
  • The ability to partition messaging workloads as application requirements change
  • Real-time streaming for data processing
  • Native support for data/message replay

Concepts of Apache Kafka

Topics: A topic is a fairly universal concept in publish/subscribe messaging. In Apache Kafka and other messaging solutions, a topic is an addressable abstraction used to show interest in a given data stream (series of records/messages). A topic can be published and subscribed to, and is an abstraction layer that is used by the application to show interest in a given stream of data.

Partitions: In Apache Kafka, topics can be subdivided into a series of order queues called partitions. These partitions are continually appended to form a sequential commit log. In the Kafka system, each record/message is assigned a sequential ID called an offset that is used to identify the message or record in the given partition.

Persistence: Apache Kafka operates by maintaining a cluster of servers that durably persist records/messages as they are published. The Kafka cluster uses a configurable retention timeout to determine how long a given record is persisted regardless of consumption. While the record/message is within the retention timeout the record/message is available for consumption. Once the record/message exceeds this retention timeout the record/message is deleted and space is freed.

Topic/Partition Scaling: Because Apache Kafka operates as a cluster of servers, topics/partitions can be scaled by sharing the load to each server on a given topic/partition. This load sharing allows each server in the Kafka cluster the ability to handle the distribution and persistence of records/messages on a given topic/partition. While individual servers handle all distribution and persistence, all servers replicate data providing fault tolerance and high availability in the event a server fails. Partitions are segmented between servers having one server elected to be the partition leader and all other servers acting as followers. The server that is the partition leader handles all distribution and persistence (reads/writes) of data and the followers provide replication services for fault tolerance.

Producers: In Apache Kafka, the concept of a producer is no different than most messaging systems. A Producer of data (records/messages) defines what topic (stream of data) a given record/message should be published on. Since partitions are used to provide additional scalability, a producer also can also define what partition a given record/message is published to. Producers do not have to define a given partition and by not defining a partition a round-robin style of load balancing across topic partitions can be achieved.

Consumers: Consumers in Kafka, like in most messaging systems, are the entities that process records/messages. Consumers can be configured to work independently on individual workloads or cooperatively with other consumers on a given workload (load balancing). Consumers manage how they process a workload based on their consumer group name. Using a consumer group name allows consumers to be distributed within a single process, across multiple processes, and even across multiple systems. Using consumer group names, consumers can either load balance (multiple consumers with the same consumer group name) record/message consumption across the consumer set, or process each record/message uniquely (multiple consumers with unique consumer group names) where every consumer subscribed to a topic/partition gets the message for processing.

Business benefits of Apache Kafka

Apache Kafka has some significant benefits due to the goals of design it was built to satisfy. Apache Kafka was and is designed with three main requirements in mind:

  • Provide a publish/subscribe messaging model for data distribution and consumption
  • Allow for long-term storage of data that can be accessed and replayed over time
  • Support the ability to access data in real-time to provide real-time stream processing

This is where Apache Kafka truly shines. Unlike some messaging systems, it doesn’t provide all the bells and whistles for transactionality or different distribution models. It is focused on providing data distribution for a publish/subscribe model that supports stream processing.

Secondly, since it is designed from the ground up to provide long-term data storage and replay of data, Apache Kafka has the ability to approach data persistence, fault tolerance, and replay uniquely. This is uiquely seen in how Apache Kafka handles persistence of data replication in the cluster, scalability by allowing for sharing of data across partitions for increasing data volumes and load, and access of data using topics/partitions, data offsets, and consumer group names.

Lastly, because Apache Kafka was originally designed to act as the communications layer for real-time log processing, it lends itself naturally to real-time stream processing applications. This makes Apache Kafka ideally suited for applications that leverage a communications infrastructure that can distribute high volumes of data in real time.

Seamless Messaging and Streaming Functionality: When dealing with large volumes of data, messaging can provide a significant advantage to communications and scalability compared to legacy communications models. By melding messaging and streaming functionality, Apache Kafka provides a unique ability to publish, subscribe, store, and process records in real time.

Time-based Data Retention for Data Replay: Apache Kafka’s ability to natively store data persistently to disk across a cluster allows for a simple approach to fault tolerance. When coupled with the ability to recall stored data based on time-based retention periods, and access the data based on sequential offsets, Apache Kafka offers a robust approach to data storage and retrieval in a cluster setup.

Foundational Approach for Stream Processing: Being able to move data fast and efficiently is the key to interconnectivity. Apache Kafka provides the foundation to move data seamlessly either as records, messages, or streams. Before you can inspect, transform, and leverage data, you need the ability to move it from place to place in real time, and Apache Kafka provides a native approach for moving and storing data in real-time.

Native Integration Support: One-size-fits-all is never a good approach, and Apache Kafka provides native ability to expand and grow by providing native integration points using the Connector API. Using the Apache Kafka Connector API, applications can integrate with third-party solutions, other messaging systems, and legacy applications either through pre-built connectors or open source tools, or purposefully build connectors depending on application needs.

Putting Apache Kafka into action

Apache Kafka can be used for numerous application types to power an event-driven architecture, from real-time message distribution to streaming events. Take for example a manufacturing company that is moving to an event-driven architecture to provide more automation, more tracking, and faster delivery of products and services. While there are many components needed to supply the end-to-end event-driven architecture, communication is at the foundation of how events are streamed and processed. Apache Kafka provides the ideal solution for enabling data distribution and streaming in a lightweight distributed fashion. With Apache Kafka, process records/messages can be streamed to one or multiple applications driving additional workloads as the product is moved through the manufacturing process. In the same way, notifications can be generated and tracked as the product is processed through the manufacturing line; anomalies can be caught and handled in real time compared to the manual process that only caught exceptions at the end of the process. Lastly, an enhanced consumer experience can be delivered by providing detailed insight into where the product is in the manufacturing process and how the product is being assembled, tested, and delivered.

At the core of all these improvements is Apache Kafka supplying the nervous system for data distribution and communication, in addition to providing increased time to market and reduced costs.

KAFKA cluster diagram