Quality
8 min
Apache Kafka is an open-source distributed platform for processing streaming data. With thousands of businesses using Kafka for their big data needs, including more than 80 percent of Fortune 100 companies, Kafka is an excellent choice for handling extremely large numbers of events in real-time.
Two concepts that are central to Apache Kafka—and to database systems in general—are consistency and completeness.
So what are consistency and completeness in Apache Kafka and how does Kafka achieve consistency and completeness for its users?
In database systems, consistency refers to the fact that transactions can only bring the database from one valid state to another. Eventual consistency is a model for achieving consistency in distributed computing systems and databases.
If a system has eventual consistency, then any update to a particular node in the system will eventually be reflected in all nodes in the system.
Achieving eventual consistency in distributed systems like Apache Kafka is no small feat, however. Crashes and other glitches can interfere with the ability to efficiently deliver records from one location to another.
In particular, suppose that the system crashes while processing a record but before committing the transaction.
Once the system reboots and recovers, there is a risk that it will either not finish processing or will process the same record twice, leading to inconsistencies between nodes.
In database systems, completeness refers to the fact that the final results of stream processing should be stored in the order that the records were created, rather than the order that they were processed.
In publish-subscribe architectures such as Apache Kafka, a record’s “event time” is when it is created (i.e. published) and its “processing time” is when it is processed by a subscriber.
Distributed systems need to handle issues of completeness because records’ event times and processing times may be out of order. Latency, network delays and clocks that are out of sync are some common causes of this discrepancy.
If a database system has the property of completeness, this means that results are stored in order of their event time, rather than their processing time.
Completeness is a strongly desirable property for database systems. Without completeness, streaming processors may incorrectly emit partial outputs based on out-of-order data as the system’s final results.
The good news is that Apache Kafka is able to achieve both consistency and completeness for distributed processing of streaming data (these two objectives are part of the larger goal of correctness for database systems).
So how does Kafka accomplish having consistency and completeness for real-time event processing?
First, let’s discuss consistency in Kafka. Apache Kafka provides multiple delivery guarantees that users can leverage to ensure that messages are always delivered and that the system remains consistent.
In “at-least-once” delivery, Kafka guarantees that all messages will be delivered at least one time—and possibly more, depending on how many times the producer tries to send the message.
A stricter guarantee is “exactly-once” delivery in Kafka, which guarantees that all messages will be delivered only one time. Distributed event processing systems can use Kafka’s “exactly-once” delivery to assure that the system’s property of eventual consistency will be preserved.
Behind the scenes, “exactly-once” delivery in Kafka is implemented with a two-phase commit (2PC) protocol.
Each read/process/write step in Kafka is appended to a log using an idempotent operation (i.e. the append will happen only once even if the operation is executed multiple times).
The system can then check these logs to see if there are any pending transactions that need to be finished.
What about completeness in Kafka? In Apache Kafka’s Kafka Streams client library, completeness is achieved with so-called “revision-based speculative processing mechanisms.”
In other words, Kafka Streams assumes that (most) records have arrived in the correct order based on their event time, which helps improve performance because the system does not wait to emit results.
If Kafka Streams does discover that some records are out of order, it updates its output by invalidating the previous results, revising them, and emitting the correct results.
Apache Kafka is a tremendously powerful platform for processing real-time data—but only if you know how to use it.
Many companies struggle on their path to Kafka adoption due to various technical challenges and complexities.
Don’t have Kafka experts on your in-house IT staff? No problem. You can join forces with a knowledgeable, time-tested Kafka managed services provider to help install, deploy and maintain Apache Kafka in your IT environment.
Adservio employs a skilled team of IT professionals who help companies achieve digital excellence in use cases from application development and analytics to software delivery and process automation.
If you need assistance with Apache Kafka, we can help. Get in touch with our team of experts and let us know about your business needs and objectives.