Counterexamples regarding consistency in event sourced solutions (Part 4)

In Part 3 we looked at a counter-example rooted in the lack of a clear 'source' for transitions of data. Having competing sources of truth is something I recommend avoiding whenever possible.

In this last part of the series, we'll look a bit further into that problem, how we may have arrived there, and how to mitigate the forces of legacy that push towards a stateful solution so we may define a proper 'source of truth'.

Multiple sources of truth

Problems

It’s another 'two-phase commit' implementation, the difference from the former example of this, is that we are not just (partially) committing to publishing a message, we are (partially) committing to the new truth about the state of our system. If this problem is not obvious to the development team and they feel ready to implement a solution with event sourcing, I suggest this is a red flag and advise caution. This one is probably the most expensive and unfortunate counterexample I’ve come across in a production system.

//"We still depend on the usage-patterns of the legacy SQL database, but we want the advantages of event sourcing"

ChangeTaskStatus(Guid taskId, Status desiredStatus)
{
    var entity = _sqlDbContext.Tasks.First(t=>t.Id == taskId);

    //Check business logic
    if(entity.MayTransitionTo(desiredStatus))
    {
      entity.Status = desiredStatus;

      //Save changes in EventStoreDB
      var taskAggregate = _eventStoreRepository.LoadTask(taskId);
      taskAggregate.ChangeStatus(desiredStatus);
      _eventStoreRepository.CommitChanges(taskAggregate);

      //Save changes in SQL
      _sqlDbContext.SaveChanges();
    }
}

This is problematic because:

In the example above, it leaves us with an inconsistent state whenever DbContext.SaveChanges() fails.
There is probably a lot of places in the legacy codebase that changes data in the SQL-database: it can be hard to know if we have introduced the 'write to eventstore' in all relevant code in a typical legacy codebase
Our domain logic is either duplicated or simply left out of the aggregate, which was supposed to be our transactional boundary from where we enforce invariants and assert consistency. Note in the diagram above, the business-logic is an extension-method to the entity from the chosen ORM-framework
When investigating bugs related to split-brained parts of our system, it can be hard to know which source is wrong and which is right, or if both sources are wrong
It will leave our data inconsistent, and those inconsistencies propagate as the data is used
Maintaining the same 'truth' in our different sources over time becomes a maintenance nightmare

Symptoms

Split-brain syndrome!

There is no longer a clear answer to the question of 'what to trust as a source of truth' for a given set of data. At this point, we become a victim of Segal’s law, never knowing what is wrong or right.

Operating such a system, we probably have support-tools/data-patches compensating for inconsistencies going both ways. For example, we might have a 'fix task assignments in SQL according to ES' tool, and a 'fix task assignments in ES according to SQL' tool.

Patching data in a support context in order to fix consistency is an indication that something is wrong: doing it regularly (several times a day) is an indication something is very wrong.

Consequences

Developers are allocated to patching data in production rather than developing new features. Everything downstream from a write-model with these kinds of problems inherits the problem and becomes suspect as well, so most consequences of the previously mentioned counterexamples also apply to this one. To sum it up, it’s not good!

Treatment and recovery

I’ve only seen this manifest when jerry-rigging event sourcing into a state-based (monolithic) system. We need to introduce some clear (vertical) boundaries: which parts of the system should be event sourced, and which should not? Exploring this problem often requires looking for new axes or seams to partition the system by. The seams probably do not fit existing tables in the monolithic database model; event based seams typically follow the processes of a business, rather than the information involved in the processes of the business. Make sure the implementation follows through on this decision: for the event sourced parts, do not update state as an optimistic 2nd commit. Instead, project the SQL-state asynchronously based on events committed to the event store.

The primary impendance against what I have just proposed is the need to satisfy an integration that expects a synchronous request/reply model, with the reply containing a state after the requested transition. The cleanest options to this problem would be to either change the integration towards asynchronous communication and changing the client(s) to handle this, or go back to the old solution which did not involve event sourcing. In case these options are not feasible, we need to create other ways to satisfy the integration. There are several options for temporary workarounds, all of which come with important trade-offs. Which one to use depends on our context: I’ve used the following workarounds myself and have yet to regret it, given the constraints in the situation.

Poll and wait

Assuming we return the version of a stream after events have been persisted, we could poll the version in the projected state and retry until they match the committed version, and then return the projected state once it does. The downside is the resources (threads, request queue limit etc.) used while waiting and observing the projected revision. The upside to this workaround is that it isolates changes to the implementation of the API, no changes are needed on the consumer side (disregarding potential adjustments of timeout-tolerances).

//At seam to legacy integration, that expects synchronous request/response

  var cmd = request.Mapped();
  var commitRev = _commandHandling.Handle(cmd);

  int retries = 0;
  while(commitRev > _readService.RevisionOf(cmd.AggregateId))
  {
      await Delay(100 ms);   
      if(retries > 5)
          throw TimeoutException(); 
            //^ Tell user to refresh the page?

      if(retries++ > 1)
          _metrics.Register("Client waiting for projected state");
  }

  return _readService.TaskDetails(cmd.AggregateId);

Update the UI according to command result

In a CQRS architecture, we strive not to couple the handling of a command to the information in views from which users may invoke such command on the system.

Typically, that view already has the information it would want to show once the command is handled. Information in the command typically originates from selection and input from a user in that view.

Counterexamples-9

The 'workaround' is to have the execution of a command respond with whether the command was handled, queued, rejected, or failed.

I use HTTP Status codes 200, 201, 422 and 500 respectively.

In case the command was handled, the UI (who made the command/request and therefore already has the information in it) can update itself.
When queued, the response contains a resource to poll or follow.
When rejected, the UI can inform the user about the rejection, and explain which rule/condition prevented the transition.
If it failed, the UI could excuse the problem and suggest trying again or contacting support.

Counterexamples-10

There are at least two downsides to this workaround. Firstly, the command might not contain all the information needed for the view. As an example, the UI cannot know what the serial number of my new task will be, that information must be generated centrally and is not part of the 'Create Task' request.

Secondly, assumptions are a hidden but rather strong coupling: it's like we betting on a specific 'cause and effect' relation always being true, because the UI only knows about the trigger for the cause. As features change, we will either be losing that bet, or restrict our options and momentum by keeping that coupling (or both!). This downside is manageable to a certain degree when the code is owned by the same team, and people can remember where a certain outcome is assumed on a command-request returning '200 OK'. As features evolve and outcomes change, this implicit coupling can become hard to work with, especially between teams.

UI reacting to messages from the backend

An alternative version of the "Command result"-workaround is to expose the output (new events) as part of the “Handled”-result, and apply those events to state in the UI. This mitigates the level of “Assumptions”, but there are certainly valid reasons why we would not want to leak domain events to the client, for example:

The events could contain sensitive information we don’t want to expose.
Domain events should be considered part of a Bounded Context’s internal model, not part of the boundaries exposed API. This is because we want to be able to iterate our model without re-working external downstream consumers. If we consider the UI as part of the respective bounded context, this argument may be less valid (but not invalid).
There might be descriptive information required by the UI that is not part of the events – for example the events could include id’s of contractors, and the UI needs to show the names and companies of those contractors – a new problem which we would need to come up with a solution for.

So instead of returning the domain events to the UI, I would suggest a clearer separation of UI and Domain Model. One way to do this is to isolate the domain events in some "backend for the frontend", apply the new events of a transition to an otherwise projected state for the frontend, before returning the final product to the UI. think of the result as a transient state for a “UI-specific message”.

From this point on, once we can supply the UI specific response-messages, we can refactor UI towards applying those messages asynchronously, rather than reacting to a synchronous response. Gradual refactoring of the UI towards asynchronous event handling could look like this:

1: Originally, the handling of response was inlined in the function sending the request:

function reportTaskTransition(data){
  $.ajax({
    dataType: 'json',
    url: url,
    data: data,
    success: function (response) {
      //DOM or Model manipulation
    }
  });
}

2: We then introduced a UI event just as a name, and the handling of it moved to a separate class. Request function is still calling the response function at this point.

function reportTaskStatusTransition(data){
  $.ajax({
    dataType: 'json',
    url: url,
    data: data,
    success: function (taskStatusTransitioned) {
        view.on_taskStatusTransitioned(taskStatusTransitioned);
    }
  });
}

class TaskView {
    function on_taskStatusTransitioned(msg){
        //DOM or Model manipulation
    }
}

3: Then we put in a message-seam between the view and the request/response. As long as we can dispatch "TaskStatusTransitioned" somehow, it no longer needs to be initiated from a request/response-loop

function reportTaskStatusTransition(data){
  $.ajax({
    dataType: 'json',
    url: url,
    data: data,
    success: function (taskStatusTransitioned) {
        reactor.dispatchEvent(new CustomEvent("TaskStatusTransitioned", {
            detail: taskStatusTransitioned}));
    }
  });
}

class TaskView {
    constructor(reactor){
        reactor.addEventListener("TaskStatusTransitioned", on_taskStatusTransitioned);
    }

    function on_taskStatusTransitioned(msg){
        //DOM or Model manipulation
    }
}

Once we can apply messages to state in UI, we don't actually need that message to originate from a response of a request. At this point, we are more or less a middle-ware implementation away from asynchronous communication, for example using WebSocket. Once the client consumes outputs asynchronously, that decoupling frees us to produce these outputs in any way we see fit in the backend.

Epilogue

The change from a monolithic 'state-based' model to an event-based 'temporal' model is not just an implementation detail: it’s more like a paradigm shift. It calls for different solutions, and challenging the problems as expressed from a 'state-based' point of view. Trying to emulate a 'state-based' solution with event sourcing will probably not be worth the effort. Conversely, rethinking the problem and finding simpler and more elegant solutions to a revised problem-statement can be a game-changing exercise. It’s not that it’s categorically harder to solve problems with event sourcing, many solutions are much easier to implement with event sourcing than with a monolithic 'current state'. Typically, the problem of scaling a system is one that is hard to solve in the 'state-based' paradigm, and I think this is where event sourcing (and CQRS) really shines. At that juncture, where scaling is our problem, we will probably already have an existing (more-or-less 'legacy') system in production, and this is where it gets hairy. Transitioning a system from a 'state-based' implementation to an event sourced implementation, while learning about event sourcing for the first time, is an arduous journey for any team. My advice for any team looking to take this journey is:

Go for it, but start with a narrow vertical scope with clear boundaries. Learn from production and evaluate the design before transitioning larger parts of our system.

I hope this article will help prevent some unfortunate design decisions and maybe inspire victims of such decisions with ideas about how to recover.

If you have more counterexamples, hints about where I am mistaken, or have feedback of any kind, I invite you to reach out to me on Twitter.