Counterexamples regarding consistency in event sourced solutions (Part 1)

Some mistakes can easily be prevented, by knowing how to avoid them. Some things may seem implicit and obvious to some people, while seeming esoteric and opaque for others. Software is hard and involves hard problems, and there is so much to learn; I write this with compassion and respect for developers currently in the situations I might describe in these posts.

I remember one summer's day as a child, getting shocked by grasping an electric fence without knowing it would hurt. If I had known it would hurt, I would not have touched it, and so would have avoided the shock and pain. Some people may touch the electric fence on purpose (how bad could it be, right?), but I digress.

This article in four parts is meant to be a friendly warning sign: “Careful: electric fence!”. It's about the things that will hurt you, unless you avoid them, regarding the consistency of data and behavior in event sourced solutions.

When I first heard about event sourcing, it honestly blew my mind; the sheer amount of new possibilities and opportunities it facilitates still fascinates me to this day. I started working with event sourcing in 2014, since then I’ve been working with event sourcing in legacy systems at different jobs. My primary experience comes from solutions already in production, and the problems posed by those solutions. Based on those experiences, I would like to share some examples of what not to do, at least without thoroughly considering the consequences. I’ve personally worked through the counterexamples given herein, except for the 'two-phase commit for publishing events to a queue' which is borrowed from the war stories of others.

I wish to extend my sympathy for any one currently in one or more situations described here. My ambition with this article is to provide helpful counterexamples and suggest possible recovery options. I do not consider myself an expert or authority on the subject, and the advice I give is based on what worked for us in these situations.

I’ll stick to the SQL database as the persistence of stateful read-models in these examples, with pseudocode abstracting the actual t-sql statements. The examples could also (not) work with state persisted in other databases, serialized objects in a file, or objects in application memory.

When I started working with line of business software, and in school before that, we were taught to implement synchronous, 'current state' based solutions, using ACID transactions. This approach of designing solutions is widespread and prevalent, understandably so, given its success with the kind of software we used to build. This way of thinking extends even beyond Developers as Product Owners, Domain Experts, and even end users build their understanding of software from the software they have been exposed to. To my mind, the commercial software industry has taken on a 'state-bias'. This bias gives us a sort of blind spot that might prevent us from seeing the problems described here, before we stumble upon them in production.

When sparring about how to unravel such a problem, we came upon 'timelines' as a useful metaphor: a SQL database has a single 'timeline', in the sense that one change may enforce implicit locks on the entire database. While the SQL database has a write-ahead log which seems like the $all stream of EventStoreDB, they have vastly different constraints you can employ. The main difference is the breadth of the scope we can lock at one time. Another word for this is 'transaction scope': idiomatic event sourced solutions prohibit wide transaction scopes, while SQL makes such wide transaction scopes easily. A 'timeline' is not the same as a stream: two subscribers of the same stream might be at different positions of that stream, when that is possible, we think of those subscribers as 'on separate timelines'. 'Subscription' is an overloaded term; a 'timeline' is not the same as a 'subscription', since subscriptions might be layered, composed, chained, or otherwise aggregated (possibly in the same timeline).

If this metaphor of 'timelines' makes less sense to you, please bear with me, as I will be using this as a simile for a 'temporal boundary wherein we can guard consistency invariants'.

Counterexamples-1

Above: SQL database, one timeline Below: EventStoreDB, many timelines

Counterexamples-2

How can it ever work?

"A man with a watch knows what time it is. A man with two watches is never sure." Segal's law

It turns out we can solve problems with the integrity of data and consistency of behavior with Segal’s two watches, once we start reacting to the individual time of each watch, and add our own timeline to the equation:

Whenever I observe the yellow watch reached 10 pm, I go to sleep
Whenever I observe the blue watch reached “time to go”, I get going
… unless I observed that 10 times since I last slept.

With the SQL database, we can implement our assertions for consistency as unique indexes, composite primary keys, foreign key constraints, or even triggers. This gives us certain guarantees for transitions of data within one database, which we can reason about in its entirety as 'one state'. With event sourcing, we rely on optimistic offline-locking on the length of the particular stream we are writing to, as the only way to ensure that 'the state evaluated has not changed in the meantime'. We can use that guarantee to enforce invariants in code based on information in (and to) that stream.

Beyond this, we may use information from outside that stream in evaluating those invariants, when we can accept the risk of external information being stale when we use it. There are many situations where using external information is the right choice, even without a guarantee that it is the latest version of that information, or that it does not change before we commit our events to our target stream. There are also many reasons for not accepting that risk, in which case we need to model the constraint as another stream or build a process for it.

These decisions are not just implementation details, they are business decisions. Domain Experts and Product Owners, having used business software previously, are often also 'state-biased' which affects how they resolve their problems and consider solutions. My experience is that a lot of design and reasoning improves when we start communicating about what happens (or not) in a system, rather than which data is where at a given time. Making those 'things that happens (or not)' less transactionally coupled and more explicit and precise might unlock business opportunities, especially if it makes our product stand out from the competition. Besides being different in where it fits in a typical system, event sourcing also requires us to take more nuanced design decisions regarding data consistency and guarantees. Without making those choices deliberately, we might be in for some unfortunate surprises once the system goes to production.

What follows in this four-part series are examples of solutions with incidental consistency-related problems, and suggestions on how to spot, understand, and fix them.

Confusing the timelines in the processing of events

Problems

State-projection is using state from another projection (on another timeline) to enrich event information within it's handling of that event.
The policy/saga/process manager is using a state that is the product of a separate subscription (on another timeline).

Symptoms

Consistency/integrity issues in state-projections.
Incorrect behavior of automated processes.

Consequences

The projected state becomes inconsistent when the race condition with the other timeline fails. Rehydration and versioning of state projection becomes indeterministic. When a policy/saga/process manager reacts to an upstream event using a state from a stale external state-projection, we get unpredictable and undesired outcomes. We might email contractors about outstanding deadlines, even though that contractor is no longer responsible for any tasks.

Let's look at an example:

The TaskUI projection builds a readmodel that exposes, among others, a Outstanding Deadlines-query. In order to do so, it depends on the Availability of the assigned Contractor. Availability is already exposed via the Contractor Lookup-query, which is projected from another timeline. This means that the TaskUI projection depends on the Contractors projection via the Contractor lookup. This dependency introduces a race-condition which causes an incorrect state in the TaskUI projection whenever the Contractors projection is catching up.

Counterexamples-3

//In 'TaskUI' projection:

 Handle(DeadlineExceeded ev)
 {
    var task = _tasks[ev.TaskId];

    //Retrieving contractor-information from external source
    var contractor = ContractorLookup.ById(task.AssignedContractorId);
 
    if(contractor.Availability == "Terminated")
        task.ExceededBy(task.AssignedContractorId);
    else
        task.NeedsImmediateFollowup(task.OverseerId);    
}

I’ve seen workarounds for this issue where information is retrieved by loading an aggregate within an event-handle of a policy/saga/process manager. This was done to circumvent eventual consistency, with the best of intentions, by having the Aggregate Root expose state via get-properties. While this solution does make sure we have the 'latest and greatest' version of the aggregate, it gives us a new set of problems:

If used inside an event-handle of a state-projection, rehydration becomes indeterministic.
If used inside an event-handle of a policy/saga/process manager, it might be a problem that the dependent state is ahead of the timeline of the policy.
It is additional IO which incurs a performance impact.
The aggregate loses precision and takes on unrelated responsibilities and coupling.

Treatment and recovery

In general, I try to avoid 'external' sources for information in an event-loop when possible. Sometimes we may not have that luxury, and in those cases we should consider which guarantees and safeguards we might need. In this example, however, there is no actual need for such complication. Rather, we can make sure all information used to build the projection comes from events in the same ordered timeline. We can do this by tracking and maintaining the state of contractor availability in the TaskUI projection:

Counterexamples-4

Even though the availability is not exposed in any queries backed by this projection, we need it to compute state that IS exposed, so the projection depends on that information. However, we might only care about whether a contractor has been terminated, we do not have to track all information about a contractor.

//In 'TaskUI' projection:

Handle(DeadlineExceeded ev)
{
    var task = _tasks[ev.TaskId];
    var availability = _contractorAvailability[task.AssignedContractorId];
 
    if(contractor.Availability == "Terminated")
        task.ExceededBy(task.AssignedContractorId);
    else
        task.NeedsImmediateFollowup(task.OverseerId);         
}

Handle(ContractorTerminated ev)
{
    _contractorAvailability[ev.ContractorId] = "Terminated"
}

The same approach works for the policies/sagas/process managers: ensure that any state used by them is a product of events consumed in the same timeline as the events that trigger desired behavior.

To be clear, it's not that the $all-stream is important here, any one stream with persisted ordering gives us the guarantee we strive for. Normally i would generate and subscribe to a projected stream in order to only receive and deserialize events from relevant categories of streams, but the current version of EventStoreDB provides a new "server-side filtering" feature which is a simpler solution for most scenarios like this one.

End of Part 1

In Part 2 we'll look at problems related to failing projections as well as some arguments against publishing events via a queue rather than an eventstore.