Introduction to Apache Kafka – Swiss Army Knife of Just about Everything High on the Hype Cycle

Roman Kosaminsky at ERNI Switzerland offices.

Introduction

You may have heard about a middleware called Apache Kafka. As per their own claim Apache Kafka is used by more than 80% of Fortune 100 companies.[1] In my own perspective it is arguably one of the more important pieces of software that came into existence in the last decade. Let me tell you why I think so.

What Is Kafka and How Does It Work?

Apache Kafka is an event streaming platform. It is used to publish and subscribe, store, and process streams of events. It works like a big, distributed, highly scalable, elastic, fault-tolerant, and secure event pump with storage capabilites[2] and it is executing its tasks in an incredibly fast way[3].

Speaking in architectural terms, Events can be considered “a significant change in state”[4], in our case it is a change in state of any functional process or other aspect of an IT system. For two IT systems to communicate, the information that needs to be communicated has to be packaged into a Message, which is “a data record that the messaging system can transmit through a message channel.”[5]. Connecting two systems sending Messages back and forth between them is what we call System Integration. For the sake of this article – even though it is not scientifically exact – the three terms Event, Message, and Data are considered to be synonymous.

All the information that flows between all kinds of systems on various levels manifests as Messages that are transmitted via a Message Channel. This could mean data flowing between single microservices, REST API calls, log entries, interface calls of pre-packaged software (e.g. SAP, Microsoft Dynamics, …) and even function calls. From a conceptual perspective Apache Kafka establishes this Message Channel.

Kafka organizes events through Topics. Topics can be imagined as public announcement system, where (zero, one, or multiple) Producers can announce events related to the the same subject. (Zero, one, or multiple) Consumers on the receiving end listen to events announced for the particular subject. They “subscribe” to the Topic. This communication pattern is known as “Publish-Subscribe”. It is effectively a message broadcast, where any number of Producers “publish” messages related to a specific Topic and any number of Consumers listen to the messages they have an interest in. The Topic is a “Publish-Subscribe” Message Channel.

A very useful aspect of a Topic is, that it is actually an ordered sequence of Records, i.e. key-value pairs with a particular Offset. If a Consumer “consumes” a message, it is not deleted from the Topic. The whole thing basically works like a database transaction log. How many messages are kept in your Topic can be configured by defining a retention time. You can set retention time to infinity and all the messages ever published via a Topic are kept forever (or as long as there is storage capacity left). A Consumer cannot only ask the Topic for the latest message, it can ask for any arbitrary message still retained in the Topic using an Offset. The Topic acts like a (very basic) database.

Scalability is achieved through the use of Partitions. Every topic contains messages with a key and a value, e.g. topic “orders” contains messages with “orderId” as the key and a JSON document describing the order as a value. Runtime and storage-wise every topic is spread across multiple Broker nodes. Each specific node manages all messages for a specific Partition of our topic. Partitions are identified by message key. In context of our “orders” topic this means, that node A manages all orders with “orderId” 1 and node B manages all orders with “orderId” 2. Thus Partitions enable parallel reads/writes across multiple nodes. To scale you just have to add nodes. That is what we call horizontal scaling. This describes the “distributed” aspect of Apache Kafka as well, as topics and their partitions are spread across a cluster of nodes.[6]

Apache Kafka is fault-tolerant and prepared for high availability through replication. Replication basically creates a configurable number of copies of your topics. These copies can be established locally, across a cluster, across a data centre, or even across different geos. To avoid message loss and enable maintenance and update cycles it is industry best practice to set the replication factor to a minimum of 3, so that there are always three copies of your topics available.

With its support of TLS, Kerberos, strong encryption, UNIX-like access control to data, and authentication mechanisms for communication between Kafka nodes and the management node, Apache Kafka provides a high level of security in addition to all aforementioned capabilities.

Thus Apache Kafka as an event streaming platform with storage capabilities essentially works like the distributed, highly scalable (…) spinal chord of your IT system (or system of systems). It can connect everything with everything, has a “memory” and is incredibly powerful and simple at the same time.

An Architect’s Perspective on Kafka

The unique combination of its capabilities along with its simplicity, horizontal scalability and high throughput makes Apache Kafka a powerful, versatile tool to be used at the core of various architectural patterns and approaches particularly in context of modern Event-Driven Architectures, Event-Driven SOA (aka SOA 2.0), and others. As Apache Kafka allows for Producers and Consumers to be established through custom development, it can integrate cloud-based containerized Microservices and APIs just like it can integrate monolithic structures or mainframe systems and thus easily create an integrated enterprise application landscape or enterprise architecture… or help with the most simple integration use case on the lowest level of a single product.

Log Aggregation

Log Aggregation is an aspect of application and infrastructure monitoring and it relates to the collection of log-structured events emitted by software and hardware systems. Monitored log-structured events can occur at high rates overwhelming traditional databases. Apache Kafka can be used as buffer collating the events into a read-optimised database. In context of Log Aggregation this is similar to what Logstash is used as part of the ELK stack. Generally speaking Apache Kafka can be used as a scalable, persistent message buffer.

Log Aggregation

Log Shipping

Traditional database management systems keep a history of executed actions called Transaction Log. The entirety of the Transaction Log represents the state of a database. Log Shipping is the process of copying Transaction Log entries from a primary system to one or more replicas. This way a full backup of the primary database is established with every replica to increase database availability. Kafka topics work like transaction logs. The Publish-Subscribe pattern established by Kafka fans out out the log entries of the transaction log of the primary database to multiple replicas.

Log Shipping

Publish-Subscribe

As described, the Publish-Subscribe messaging pattern establishes a message broadcast where message producers and message consumers are basically unaware of each other. Consumers are concerned with specific content categories – i.e. topics – fed by Producers. This pattern is used in all kinds of integration scenarios from integrating a full enterprise application landscape to integrating loosely-coupled microservices of a single system.

Publish-Subscribe

Staged Event-Driven Architecture (SEDA)

The Staged Event-Driven Architecture[7] approach breaks down an event-driven application into connected processing stages (i.e. services in a cloud environment). These stages are connected by queues/topics. This way a unidirectional event cascade is established, where each stage can scale according to its respective load. SEDA increases the modularity and improves decoupling of services across a system. SEDA as an architectural pattern is used by e.g. Apache Cassandra and other Big Data applications.

Staged Event-Driven Architecture

Complex Event Processing (CEP)

Complex Event Processing (CEP) is a set of concepts to identify patterns and extract information from event streams to help with e.g. fraud detection or security analysis.

Complex Event Processing

Kafka and Everything High on the Hype Cycle

Across the IT industry a broad variety of technologies and concepts are marketed with a multitude of buzzwords. The more up-to-date buzzwords include Internet of Behaviors, Distributed Cloud (including Internet of Things Edge Cloud), and Event-Driven Architecture. Some more buzzwords that are not as up-to-date but still hot would be Artificial Intelligence, Smart Industry 4.0, Big Data, and Analytics.

Beside the fact, that some of these technologies are quite the opposite of new, all of them have one essential thing in common: at their core they need (streams of) data to be moved from source to sink, from Producer to Consumer, with high throughput, scalable and in a reliable and versatile, eventually balanced way to integrate systems with each other. You can establish huge amounts of sensors and actuators for your IoT use case or have huge machine learning models in place. If the data is not streamed from your sensors and actuators or fed into your models, there is no use case. A huge enterprise application landscape does not provide any benefit and is not even able to establish digital processes as long as the applications cannot be integrated with each other in a meaningful way.

With its capabilities Apache Kafka does just that: it is the middleware that lets data flow through your system. It integrates your IoT and Edge Devices with an AI backbone to enable Predictive Maintenance. It integrates your mainframe system’s RDBMS with modern cloud-based applications. It virtually is the Swiss Army Knife of just about everything high on the hype cycle.

About Roman Kosaminsky

Roman is a tech-savvy digital native since his early childhood (started out on an Apple II in 1980). He professionally joined the IT Consulting business in 1998 and supported the Greater Good as Developer, Architect, Project Manager, and Consultant gravitating towards Enterprise, Integration, and Cloud Architecture. Roman is one of the chosen few who joined The Dark Side (Sourcing Advisory) and successfully made it back into the light. As a principal consultant, Roman is ERNI Switzerland’s Lead Enterprise Architect and leads its Architecture Advisory Service Line. When he has got time to spare, he is a man of the great outdoors (cycling, hiking), a music lover and concert-goer, and an avid chess player.

Footnotes

[1] see the Apache Kafka Homepage (Anonymous 2021) via https://kafka.apache.org/

[2] see https://kafka.apache.org/intro

[3] see https://www.confluent.io/blog/kafka-fastest-messaging-system/

[4] see (Chandy 2006)

[5] see (Hohpe and Woolf 2003)

[6] If you want to delve into all the technical details of how Apache Kafka is doing what it is doing, a lot has been written about topics, partitions, records, keys, values, offsets, consumers, producers. I would recommend reading https://kafka.apache.org/intro as a starting point.

[7] see https://www.mdw.la/papers/seda-sosp01.pdf

References

Anonymous. 2021. “Apache Kafka.” https://kafka.apache.org/.

Chandy, Mani K. 2006. “Event-Driven Applications: Costs, Benefits and Design Approaches.” Gartner Application Integration; Web Services Summit 2006.

Hohpe, Gregor, and Bobby Woolf. 2003. Enterprise Integration Patterns : Designing, Building, and Deploying Messaging Solutions. Hardcover; Addison-Wesley Professional.

Are you ready
for the digital tomorrow?
better ask ERNI

We empower people and businesses through innovation in software-based products and services.