pl apacze

Closely Watched Kafka. Monitoring Apache Kafka in Business Process Management

"Somebody must have made a false accusation against Joseph K., for he was arrested one morning without having done anything wrong." [1] This is one of the most famous opening sentences in literature. It comes from the dystopian and absurd world of "The Trial" by Franz Kafka. In Kafka’s works, the sense of danger stems from unclear threats and a lack of transparent rules. In another classic dystopia, George Orwell’s "1984", the oppressive nature of the world is created through constant surveillance of citizens.

Interestingly, what would be unbearable in human society is absolutely necessary in IT systems. Monitoring enables them to operate efficiently, reliably, and securely.

In this article, we will show how the Flowee development team uses Apache Kafka, which runs in the background of our platform. Flowee is our proprietary BPMN platform that allows organizations to easily design, manage, and execute business processes. We will also demonstrate how Kafka supports the platform from a technical perspective and how Kafka clusters can be effectively monitored.  [2]

If you want to try out monitoring methods, log in to GitHub. We have a repository with examples..

Flowee is a platform designed to minimise the need to write code when deploying new processes for the client. It achieves this through advanced graphical editors of processes and forms. If you want to learn more about it, visit THE WEBSITE.

1. What is Apache Kafka?

Apache Kafka is a distributed messaging system designed to handle large volumes of data in real time. Its primary purpose is to enable the creation of systems that can scale easily while supporting asynchronous data processing close to real time.

Kafka was originally developed by LinkedIn and is now an open-source project maintained by the Apache Software Foundation. It is widely used for building scalable data streaming applications, log aggregation systems, telemetry platforms, and data integration pipelines. Kafka is also commonly used in microservice architectures.

The reliability and scalability of Kafka are best illustrated by LinkedIn’s own infrastructure. It's been reported that its infrastructure handles seven trillion messages per day! For those lost in the zeros – that’s seven million million messages. Quite impressive!

2. Apache Kafka in Flowee

Flowee is a distributed system composed of multiple microservices. Each microservice is responsible for a small, clearly defined part of the overall system logic. As in any microservice architecture, ensuring reliable and efficient communication between system components is crucial. For this reason, when designing Flowee we selected Apache Kafka as the backbone communication system.

3. Why is monitoring the Kafka cluster crucial?

Monitoring Apache Kafka in a production environment is crucial for many reasons, both in terms of ensuring the reliability and performance of the system and minimising downtime in case of problems. Among them, we can distinguish the following issues:

Monitoring helps understand how Kafka resources are used and where potential bottlenecks may appear. This enables performance optimization and more efficient resource management.

Early detection of problems, such as delays in processing, Kafka broker overload, or network issues, is crucial for maintaining the continuity of the system's operation. Rapid identification and resolution of problems minimises negative impact on business operations. 

Monitoring allows an understanding of usage patterns and system throughput, which is essential for scaling planning. This information helps make decisions about expanding or modifying infrastructure to handle increased load.

Monitoring can help detect unusual patterns of behavior that may indicate security issues (such as DDoS attacks or unauthorized access attempts). 

Many organizations must meet strict Service Level Agreement (SLA) requirements for system availability and performance. Monitoring allows teams to verify whether Kafka services and dependent applications meet those standards.

In production environments processing large volumes of data, it is essential for the data flow through Kafka to be as efficient as possible. Monitoring allows for identifying and eliminating bottlenecks in data flow.

Monitoring Kafka in a production environment is a matter of maintaining infrastructure efficiency and a strategic element of managing the performance, security, and operational costs of systems based on this technology.

4. How do we monitor Kafka in Flowee?

Monitoring in Flowee is realised using two tools: Prometheus and Grafana. The former collects current measurements from defined sources, such as individual microservices or Kafka brokers. Grafana, on the other hand, allows visualisation of Prometheus's measurements in the form of graphs.

Here is a repository with a ready-to-run docker-compose configuration demonstrating the operation of the configurations described in the article. docker compose demonstrating operation described in the configuration article. 

After cloning the repository to run a sample environment, use the command:, you should use the command: 

				
					docker compose -f compose.yaml up -d 
				
			

After they are launched, two panels are available in the browser: 

Grafana: http://localhost:3000/dashboards 

Kafka UI: http://localhost:8080 (password admin:admin) 

4.1. Monitoring Kafka brokers

Kafka cluster components are monitored using the jmx_exporter library (https://github.com/prometheus/jmx_exporter). This allows individual processes to expose current measurements of monitored parameters via an endpoint on port 8888. The measured values are then periodically retrieved by the Prometheus server. 

Examples of parameters we can monitor with jmx_exporterinclude: the number of active brokers and partitions, partitions with insufficient replication, the number of messages per second, and CPU load. The amount of data transferred between brokers and stored on disks is also monitored. Statistics on message producers and consumers, as well as data on internal replication in Apache Kafka, are also important.

The configuration of jmx_exporter in Kafka brokers involves running a Java process with a so-called Java agent (javaagent switch):

				
					KAFKA_OPTS: '-javaagent:/bitnami/kafka/libs/jmx_prometheus_javaagent-0.20.0.jar=8888:/bitnami/kafka/libs/kafka_broker.yml' 
				
			

While the Prometheus configuration reading values provided by jmx_exporter looks like this:

				
					- job_name: "kafka-broker" 
        static_configs: 
          - targets: 
            - kafka-1:8888 
            - kafka-2:8888 
            - kafka-3:8888 
        labels: 
            env: "dev" 
         relabel_configs: 
            - source_labels: [__address__] 
            target_label: hostname 
            regex: '([^:]+)(:[0-9]+)?' 
            replacement: '${1}' 
				
			

4.2. Monitoring message consumption delays by consumers

One of the most important Kafka metrics is consumer lag, which indicates delays in message processing.

High latency indicates that consumers are not keeping up with processing messages produced for Kafka topics. Consequently, this may indicate performance issues in the consuming applications, which in turn requires corrective action. The most common remedies involve scaling out microservice instances (increasing the number of instances) or optimizing the components responsible for data processing.

To collect statistics regarding consumption delays, we use the kafka-lag-exporter application: https://github.com/seglo/kafka-lag-exporter

It is defined in the compose.yaml file: compose.yaml: 

				
					kafka-lag-exporter: 
        image: seglo/kafka-lag-exporter:0.8.2 
        volumes: 
            - ${PWD}/kafka-lag-exporter:/opt/docker/conf 
				
			

While in the kafka-lag-exporter/application.conf directory, there is a configuration for this application: 

				
					kafka-lag-exporter { 
    port = 9999 
    client-group-id = "kafkaLagExporter"
    lookup-table-size = 120 
    poll-interval = 60 
    
    clusters = [ 
        { 
        name = "dev-cluster" 
        bootstrap-brokers = "kafka-1:29092,kafka-2:29092,kafka-3:29092" 
        
        admin-client-properties = { 
            client.id = "admin-client-id" 
            security.protocol = "PLAINTEXT" 
        } 
        
        consumer-properties = { 
            client.id = "consumer-client-id" 
            security.protocol = "PLAINTEXT" 
        } 
    } 
   ]
  }
				
			

The Grafana panel for kafka-lag-exporter is in the Kafka Lag Exporter.yaml file. It is automatically loaded at the start of the sample environment and is available at: http://localhost:3000/dashboards 

4.3 Kafka UI

The last demonstrated tool is Kafka UI: https://github.com/provectus/kafka-ui. It allows for checking the cluster's state, configuration and browsing and searching for messages stored in Kafka brokers. This is essential during development and incident analysis in the production environment.

In our demonstration environment, Kafka UI is available at: http://localhost:8080 (login and password admin:admin) 

And that's it - you now know how to monitor Kafka closely, like Big Brother! 😉 

  • [1] Franz Kafka: The Trial, Penguin Random House (German title: Der Prozess, 1925). English translation:  Willa and Edwin Muir / Breon Mitchell / David Wyllie / Mike Mitchell / Idris Parry / Susanne Lück and Maureen Fitzgibbons.
  • [2] The name Apache Kafka truly comes from Franz Kafka.
    Jay Kreps  liked the writer's work and believed that, since the system is optimised for writing, it should be named after someone who writes in natural language.

Interesting? Feel free to share!