Event Driven Architecture
1. Patterns
1.1 Idempotent Event Consumer
Consumers must be able to process events idempotently. In general, an at least once delivery guarantee is assumed. This means that a consumer may receive the same event multiple times.
There are several reasons for this:
-
Typically, consumers receive and acknowledge events in batches. If processing is interrupted, the same batch will be delivered again. If some events in the batch have already been processed, a non-idempotent consumer will process them again.
-
Some brokers (including Kafka) support transactions, allowing receiving, acknowledging, and sending new events to occur atomically. However, it’s not that simple:
-
First, these transactions are very costly and lead to significantly higher latencies. In larger systems, those costs add up. These costs arise because processing no longer happens in batches as each event is transmitted individually, along with additional calls similar to a two-phase commit.
-
Second, this approach only works as long as a single technology is used. Otherwise, a transaction manager is required - the difference between local and global transactions. Else, events may be lost or published multiple times — bringing us back to the original problem. (See atomic event publication)
-
Idempotent Business Logic
In most (or even all?) cases, business logic can be made idempotent. It’s important to distinguish between absolute and relative state changes.
-
Absolute state changes:
With an absolute state change, a value is typically replaced by a new one — such as an address. These operations are always idempotent. -
Relative state changes:
A relative state change occurs when, for example, an amount is deducted from an account balance. Reprocessing the event would alter the balance again. Alternatively, an account balance can be modeled as a sequence of transactions. If the event ID is unique and part of the SQL schema, reprocessing would result in an error — this is essentially event sourcing.
Remembering Processed Events
Alternatively, the IDs of processed events can be stored in the database. Before processing, a lookup can be performed to check whether the event has already been handled.
For consistency, it’s crucial that storing the event ID happens within the same transaction as all other domain changes.
A prerequisite for a simple solution without locking is that the same event cannot be processed in parallel by different consumers. With brokers like Kafka, this is guaranteed — there can only be one consumer per partition.
Transactional Event Processing
Another alternative is the use of local (or global) transactions, accepting the associated drawbacks — assuming the broker supports transactions at all.
It’s important to note that when using multiple technologies, a transaction manager is required. Local transactions only work within a single technology — in this case the broker. That’s why exactly once processing works in Kafka Streams, where Kafka supports the role of a database as well.
1.2 Atomic Event Publication
Every change in a domain leads to the publication of an event in an event-driven architecture.
For the overall system to remain consistent, it would be ideal if the change and the publication occurred atomically within the same transaction. This would provide the highest possible level of consistency.
However, this is not easily achievable as database and broker are typically separate systems. To enable a global transaction across both, a transaction manager would be needed.
💡 A local transaction refers to a transaction within a single system (or technology), such as a database. When multiple systems are involved, separate local transactions are created independently. What’s missing is a mechanism to coordinate them. That’s where the transaction manager comes in — a separate system that synchronizes local transactions into a global transaction. This is the only way to achieve exactly once semantics across systems. All other approaches relax this guarantee.
The problem is very similar to ACID and CAP considerations in distributed systems, where trade-offs are made between different properties—or guarantees.
What are these guarantees?
-
Ordering Guarantees
A guarantee regarding the order of processing—whether it must match the original processing sequence or not. -
Delivery / Publication Guarantees
A guarantee regarding the delivery of events—whether events may be delivered multiple times, exactly once, or at most once. -
Read Your Writes
A guarantee about immediate consistency — i.e., whether a system can read its own changes right after writing them.
It’s important to note that some guarantees exist along a spectrum. The theory is much more detailed, so this section serves merely as an introduction. There are three approaches to solving this problem. Each fulfills different guarantees and comes with its own trade-offs.
Transaction Manager
A transaction manager coordinates local transactions across independent systems — essentially acting as the orchestrator of a two-phase commit (2PC).
exactly once in-order - A transaction manager provides by far the highest level of consistency between two systems. For this reason, it is often used in banking systems.
Disadvantages
-
Performance
A transaction manager introduces significant overhead, as it is a separate service that orchestrates a 2PC between systems. In Kafka, for example, using local transactions would also mean that events could no longer be processed in batches. -
Cost
Transaction managers are often commercial products — there are few, if any, free alternatives. -
Support
The involved systems must implement certain standards to integrate with a transaction manager.
Transactional Outbox Pattern
With a transactional outbox, the problem of independent transactions is deferred by storing events in the database as part of the same local transaction. A separate process reads and publishes these events in order.
at least once in-order – The storage of events happens exactly once, while their publication is at least once. The problem of independent transactions is effectively shifted by this pattern. As long as reading and publishing do not happen concurrently, the event order is preserved.
Types of Transactional Outboxes
-
Polling Outbox
The database is regularly polled for new events. This approach has higher latency and significantly increased resource consumption. -
Tailing or Subscriber Outbox
This type subscribes to a change stream of the database. Databases like Redis and MongoDB offer built-in subscriber mechanisms. For others, the write-ahead log can be used to achieve the same effect.
Further Considerations and Challenges
-
Scaling
As throughput increases, scaling and latency can become issues. Depending on the context, sharding may help—but it’s important that the outbox sharding aligns with the broker’s sharding. Otherwise, event ordering may break. -
Single Point of Failure
A single outbox instance represents a single point of failure. Therefore, a cluster with leader election is needed to ensure high availability and low latency.
Event Persistence
Similar to the outbox pattern, events are stored in the database along with an additional status field. However, unlike the outbox, events are published by the same process followed by a status update.
An additional process periodically scans the database for unpublished events and publishes them as a fallback.
at least once out of order – Due to the nature of the publishing mechanism, maintaining the correct order is a best effort.
💡 This is by the way the mechanism Spring Modulith uses for its internal eventing. Something to keep in mind.
Separate Transaction
Another option is to more or less ignore the problem.
at most once in-order – Both local transactions would — if possible — be nested, so that a failure in the inner transaction leads to a rollback of the outer one. However, if the outer transaction fails after the inner one has been executed, this will inevitably lead to inconsistency.
💡 This is exactly what happens when using @Transactional in Spring-Kafka and Spring-Data without an external transaction manager.