Event immutability and dealing with change

One of the first things that people hear when they start working with Event Sourcing is regarding the immutability of events, and how this is a useful attribute for data which drives business logic and decisions in software. Very often, however, little more is presented as an argument for immutability other than 'auditing', and while auditing is a valid reason to adopt immutability for your write path data, there are also other reasons to embrace it, some probably more notable than auditing.

TL;DR: While immutability may seem on the surface to be problematic as a write-path database trait, it is an enabling constraint. It guides us towards an approach in dealing with changes and discoveries that has a lot of positive characteristics.

In the first part of this article, I will try to present why we should consider immutability carefully, even if you don't have auditing requirements. Then in the second part, we'll go over what you can do when you discover you need to change something.

In this article, I will assume that you're familiar with the basic concepts of event sourcing. For an introduction, see the What is event sourcing? blog post or one of the many talks Greg Young has on YouTube on the subject.

What do we mean by immutability?

People more commonly encounter the term immutability when they start working with functional languages. In that context, immutability is an approach to data where the state of an object cannot mutate (or change) after its creation. The term has a semantically similar meaning in event sourcing: it describes the practice and policy of not updating, or in any way altering, the event after it has been persisted in our database.

If you don't have any experience with functional languages, and your background is with normal form databases or document stores, the above concept may seem entirely alien. It is, however, natural to event sourcing, where we persist entities in the form of domain-specific and domain-recognised events. Whenever we need to change any property of that entity, we simply append an event in the stream that holds the data for that entity.

Two things that I have very regularly heard from people who first try to adopt an immutable store are:

Why would you want to have immutable data in the first place? With a mutable store, making changes is simple. What's the benefit?
How do you go about making any changes when you're not allowed to update data?

I'll try to answer both questions in this article.

Immutability as a desired trait

While on the surface it may seem that immutability involves more work, in practice the amount of effort involved is usually similar. Immutability represents a trade-off: you surrender a potentially familiar tech approach with mature tooling to gain:

No loss of context
The system is significantly easier to debug
When a correction is needed the process is safer, easier and better known
You have a proper audit log
Potentially better user experience when errors get corrected.

Below, I will be using an event store as an example of an immutable store to save write-path data. While there are many types of immutable databases (event stores, streaming databases, and time-series databases, to name a few), an event store is a great choice to store business-logic data. The benefits of using immutable stores would apply in all cases. Sometimes in a lesser extent, but the benefits mentioned above derive directly from the fact that we are not replacing older data, but instead we append the changes needed.

Without further delay, let's visit each in a bit more detail.

1. No loss of context

From a particular perspective, immutability is inherent to event sourcing itself. This is because, to make any change in an event-sourced system (rather than mutate a row in a table) we would emit a domain-specific event that clearly and specifically describes the change that has occurred, in as much detail as we need.

The persistence of context is the root cause for some of the other benefits in this list. It is, however, also an essential benefit in and of itself. By preserving context, we're now able to have significant insight into a potentially complex system and be able to get answers to questions historically. For example, we can find out how many goods had prices changed within a certain period. We can find out how many people had a negative balance for more than two consecutive days. We can report on the number of new daily sign-ups or the accounts that became active per day.

The ability to answer these questions is built-in when you're storing information using an immutable store, and importantly we have this historically: you will very often be able to answer questions similar to the above examples, even for pre-existing data in your system, even if you didn't prepare beforehand. With mutable systems, when any of the above requests came in, we would have to add new features to our store (new columns or tables), further work to write to those rows, and we quite possibly wouldn't be able to extract the information.

I mention this point because, while it is the root cause of other essential benefits, context on its own is of extreme importance to a competitive, lean business, who needs to outmanoeuvre the competition. Being able to ask questions, and get answers, from existing data immediately can be an immense competitive advantage.

2. Easier to debug

Assume for a minute that you just got woken up at 3:00 AM because something or other is failing in production. What kind of information would you prefer to have available?

Some rows of current values: parts of which are wrong, but you have no clue how they came to hold the data they currently do because you keep overwriting old information with new.
A detailed history of all changes that happened. Also a detailed history of all changes to all dependencies, with each change detailing what happened in great, domain-specific detail full of context.

Obviously, having a complete history of changes for your entities and all dependencies makes debugging and support much more manageable, especially when you're under pressure. Or any other time you need to debug for that matter. To achieve this we need to use metadata and follow a few basic principles, but using event sourcing and immutable events is the enabling factor, and it works really well with observability principles and tools.

Moreover, since data never change, tracing what caused a piece of data to be like it is, is often straightforward. Having causal analysis of changes be easy to do allows you to focus on fixing the problem, instead of trying to find it.

3. Safer, known process for corrections

No matter how strict you are with decoupling, you can't have a system with absolutely zero coupling. Nor should you. Systems of any complexity naturally have upstream pieces of logic that directly or indirectly cause changes to downstream parts. But what happens when the upstream component made a mistake, and that cascades to consumers?

As is often in these cases, fixing the source of the problem and its local data, is by far the more straightforward piece of work.

How should dependent components react to the correction? Should they deal with the change? Should they roll back and re-run the new request? What about if any intermediate actions happened in the meantime? Should this latest information simply overwrite the existing one? These are tough questions to answer when operations aren't commutative. And the longer it takes to discover the problem, the more difficult it is to fix.

You may not be aware of them, but when dealing with non-trivial mistakes and bugs (think issuing a wrong invoice, not entered an incorrect value in the UI), chances are that established processes already exist to fix them. This shouldn't come as a surprise; people have been making mistakes since long before automation came along, and they had to deal with them. You should definitely ask your domain experts and find these out, as these are tried and tested processes that form a core part of your domain.

But why is this mentioned as a benefit of immutable stores? Don't the same processes exist when working with mutable stores? In fact, with the tooling available in most modern databases, changing some values could be as easy as running a very short SQL update statement.

There are two primary reasons why correcting such errors is a much better experience with immutable stores:

You can use the same piece of data (a new event) to both correct the mistake and drive the corrective workflows. This behaviour is inherent in immutable stores.
Mutating live data is an operation rifle with risk.

To expand a bit on the second point, quite frankly, I find doing any form of destructive transformation of live data in a public-facing environment terrifying. Unfortunately, I also have a couple of awful horror stories that I prefer to think of as "experience". Honestly, with an immutable store, fear of deployments is vastly reduced (well at least from the data perspective). I am sure for many people who went from SQL server migrations to using immutable stores, this is the thing they love the most. If anything goes wrong on a stream, the previous data still exist. Depending on the change, getting access to that data again may not even need any change in the store itself. Moreover, it's easy to keep around data that was created between deploying and rolling back, if that is useful.

TIP: I very strongly recommend that you ask your domain experts about corrective processes during collaborative workshops (like event storming and event modelling) or interviews, as these not only lead to new insights and valuable business processes but also allow you to have a pre-approved way of dealing with some classes of errors.

4. A proper audit log

An audit log does one thing, and that is to answer the question: why did we make the decisions we made? Either to respond to requests from your users, or from legal obligations (which often exist in financial domains).

Note: In some domains it is important to keep in the audit log information that you would normally keep as part of your event metadata (time of the action, user taking the action and so forth). In the past my preference has been to "promote" these to event properties, as it guarantees that these will not be modified or ignored as part of the normal processing that happens to event metadata withing a complex system.

It is easy to add a table to store the actions and decisions when business logic makes these. If auditing is the only reason you were considering to adopt event sourcing, you'll probably be OK with that additional table.

However, using immutable events in the context of event sourcing to store information, gives you one guarantee which is essential in some domains and to legal requirements: you use the same data you use for auditing to make all subsequent decisions, and this guarantees that the auditing data are accurate and correct.

With an immutable store, no one can tamper with the data to alter it. If a piece of data is missing, it cannot affect the decision we make. With a separate audit log, you could emit more or less information due to bugs, which could put you into trouble. A lot of legal bodies are aware of domain-specific methods of storing data that behave like this, and this will help you immensely if the need arises for auditing.

5. Better user experience when errors get corrected

The first principle from Jakob Nielsen's principles for interaction design is visibility of system status. Some parts of this rule are:

Communicating to the users the systems state, at least the elements relevant to them.
Providing feedback to the users as quickly as possible, including feedback of their actions.
Building trust through continuous communication.

As an example, let's consider this: multiple users added credits to their accounts using coupons. However, due to a bug, we added twice as many credits in their accounts as we ought to. With mutable stores, we would update the value. With immutable stores, the old value is still there available, with a record of why and how we changed their credit amount. Which of these two provides a better experience?

Immutable data help you in all three points:

We can inform the user of all changes to their data, and these changes can be as visible we need to make them. If the user saw the erroneous data before, we can expose functionality to allow them to trace and audit the changes made since they saw it, providing comfort.
We communicate system state more accurately and clearly. We do this by our ability to provide a clear record of everything that caused a change to our user's data.
Not magically changing data is essential in building trust with our users.

How to deal with changes

As I mentioned above, we will inevitably make discoveries that lead us to want to change the existing system and potentially existing data. Previously I have outlined why immutability is a desired trait, but I wrote little about the 'how' of dealing with changes. Below I'll go over the most common reasons to change data in a system, and how to deal with them.

TIP: Buy and read Greg Young's book Versioning in an Event Sourced System. A lot of what I suggest below is covered in this book in greater detail.

Below, I mostly focus on dealing with changes locally. However, there is significant complexity in dealing with downstream consumers that depend on that data. You have to put careful thought about how you deal with dependents to your data.

NOTE: To help illustrate the point, I'll be using the bank account example, and also try to go through the possible changes we may need to apply in a sequencial manner. If you worked in banks, you may be aware that the stereotypical example of the event sourced bank account is, in fact, entirely wrong. This is for multiple reasons, which aren't important here. I am however going to be using this example because it is familiar to a lot of people, and will allow people to focus on the aspect of making changes.

0. Happy-path editing of a value

All systems need to change some values due to external stimuli (a user changing some data, we receive new information and need to update an entry, amongst other reasons). As a typical example of immutable stores, in an event store you would update a value by emitting a domain-specific event.

Let's take the stereotypical example of a bank account.

We could model the opening of a bank account using an AccountOpened event. We add this event to a new stream that will enforce concurrency on operations for that account, and we 'replay' this event, by projecting the data represented by the event to an in-memory structure to use in domain logic. We can see this below:

changes-0-edit-a-value-1

NOTE: I am showing the DepositMade event and event applier in the above diagram, but only because I assume that a minimum viable product won't be viable without at least the ability to make a deposit. In future examples, changes will be introduced only when needed to demonstrate the point.

When the account holder needs to make a deposit and we need to update the account balance, we could append a DepositMade event to the stream that includes information about how much money was deposited, and what was the balance of the account at the end of the deposit. When we're loading the account data from that stream, we would be projecting both events, again in memory to reflect our most recent view. This would then look as follows: changes-0-edit-a-value-2

1. Capturing more data / capturing less data

Our bank account software works fine for a while, but we (very) soon discover that a bank account needs to be aware of the currency of the money it holds. So we need to add it.

As with the bank account, as a system's behaviour expands, it is only natural that we would want to capture more data than before. Alternatively, we may find that we no longer desire to keep some data or property in our system. Both changes are much easier to do using an immutable event store compared to mutating data:

For capturing new data, you can introduce new properties in the event or a new event entirely in software, providing a reasonable default for the property for use by pre-existing events.
To capture less data, you can simply remove some of the properties of an event, or stop emitting and projecting the event entirely.

Neither of the above requires us to change any existing data in the store at all, with all changes done in application code where we can test it conveniently and thoroughly. Perhaps more valuable is that, assuming reasonable defaults exist, we can apply these changes historically, and quickly and safely revert them if needed. The only thing required is to use weak schema formats (like JSON).

To do this in our bank account software, we'd add a property to the events that need it, providing the default in some reasonable way for the language we may be using. We can see this below:

changes-1-add-properties-1

And once the account owner makes another deposit, it may look like the below:

changes-1-add-properties-2

As you can see, all of these changes have been made purely outside of our database, in a very safe, backwards compatible, and (perhaps more importantly) testable manner.

2. Introducing significant structural changes to events

Our system seems to work really well, and more people start using it, with more accounts being opened. But then we find out that some people have problems signing up. What happened in our case is that our system doesn't cater particularly well to people whose name doesn't adhere to the name\surname structure (like mononymous people for example). Our domain experts decide that the best, and more inclusive way forward, is to instead ask for a full name. This means we will no longer have name and surname in our events (or accounts), but instead a single full name property.

The change I just described is one example of a change for which deserialisation (even using weak schemas) isn't enough. From experience, this type of change typically occurs when our understanding of the domain evolves, or when new requirements introduce significant changes in our domain. In this case, the shape of the event changes, but remains the same event semantically.

You can do this change by introducing a new version of the event and rely on upcasting or parsing when projecting old versions of the event; as you load an old version of the event, you upconvert (or upcast) the old event to the new version through a small piece of code, before passing it to the projection logic. Upcasting is much easier than it sounds, frequently involving simple mapping.

So after making this change in our account software, it would look like this:

changes-2-merge-property

All the above happens exclusively in application code. Again, for emphasis: no DB migrations, no modifications to tables, in fact no destructive changes to any data at all, hooray!

Finally, keep in mind that in some cases, a parser may be a better option than upcasting. The parser would receive the raw serialised format of any version of an event and directly parse it to an in-memory structure before replaying the event.

3. Changing the events

In some cases, you will find you need to make some semantic changes to the events. These come in three types:

Splitting events
Merging events
Moving properties across events.

Note: while these may seem to be event-store-specific, they still manifest in other types of databases in the form of moving or renaming columns, or restructuring tables.

You have a couple of options here:

Keep the old events and introduce new events to deal with the changes
Make these changes during a copy-replace.

Keeping the old events and introducing new ones means that you still keep the projection logic that projects those (now obsolete) events alongside the logic for the new ones. Business logic should stop raising the old types of events, and instead only raise the new ones.

It is important to note that retaining the old events carries the benefits we already visited in earlier points, including safety, and is therefore recommended. However, keeping those events means that you also need to keep the code that deals with them; if such changes happen enough, you may end up with noise in your code, both for projection logic and the obsolete events themselves. When this happens, a clean-up is in order. That can happen during a copy-replace. A copy-replace is a process where we copy events from one event store to a new event store, at the same time making all the necessary changes.

For an example, we'll leave our bank account for a bit, and look at a loan application. We'll start with a stream that has events indicating that the loan has been requested, underwritten, with it's scheduled repayments made, and with the interest and principle paid in full. After some time working on this product, we have now come to a better understanding of our domain.

More specifically, our understanding of LoanRequested has evolved such that we now realise this as two separate and distinct domain events: LoanRequested and a separate ScheduleCalculated that represents the proposed repayment schedule, which used to be part of the old LoanRequested. We can use event migration as seen below to read this event from one stream and emit two different events to an output stream with the same name in a new instance of an event store (similar to a blue-green deployment process):

immutability-dealing-with-change-1

Since we are already doing an event migration process, it makes sense to also update some of the events we have in our stream to the latest version that we understand. This will improve the performance of rehydration, and also allow us to drop upcasting logic from our solution. Below, event migration reads one event at a time, and emits an event for each read one on the output stream:

immutability-dealing-with-change-2

Finally, we have realised that it makes little sense for our product to record separate InterestRepaid and PrincipleRepaid events as these are almost always repaid in full together with the last repayment. To do this, we have logic in event migration project that recognises those events and reads ahead to find the next one, before projecting both on a new InterestRepaid event:

immutability-dealing-with-change-3

In this above example, we have an event for LoanRequested still on V1, but the current version of our system uses V3. We're currently dealing with this using the method described above, but now we have decided that we want to split some semantics out into their own event. Specifically there are processes that require the schedule of payments to be made (which is something we calculate based on the loan request), but aren't interested in anything else form the loan request, so we decided to have it separate from the loan request itself. Also, we decided to upconvert the LoanUnderwitten event and store the upconverted event, because we want to be free of the logic to upconvert events we no longer emit, and improve performance. Finally, we decided to merge the PrincipleRepaid and InterestRepaid events to a single LoanRepaid event.

The process has some nuances if we have a lot of events, and we want to do a zero-downtime deployment, but the fundamental logic is the same.

Note: Splitting, merging or moving properties of events using a copy-replace is not destructive (we're not supposed to remove data during this time), but it is more dangerous than the ones we've seen previously. If you have a bug in the code that does the copy-replace, you may end up with corrupted or missing data. However, the process requires you to move data from one server to the other in a green-blue approach, so you have a live backup you can failover if things go wrong.

Note: Even though this is non-destructive from data aspect, in some industries, in order to still adhere to auditing regulations you may need to keep around the old events, or even not be able to use copy-replace at all. Please check with someone who knows your industry laws.

4. Errors in data

Errors in data will manifest in raising the wrong events, or the events raised contained the erroneous data.

Commonly you'd choose one of three ways to fix this type of issue:

Emit a compensation event
Revert and redo
Follow a process to resolve the issue.

It is important to note that in all the cases mentioned below, the effort involved in actually changing the data is either similar or more straightforward when we take an immutable approach. With immutability, we have to emit one or two new events, but using mutability, we could need anything from updating a cell in some rows, to deleting and changing multiple rows in multiple tables.

Any processes needed in downstream consumers will still need to be kicked off and allow any external side-effects to happen, and signalling this needs to occur with both mutable and immutable data stores.

4.1. Emit a compensating event

One option is to emit a compensating event, sometimes referred to as a partial reversal event. A compensating event is what it sounds like: you emit an additional event which has the opposite effect as the wrong one with just the right values to balance out any errors.

To demonstrate this, let's go back to the bank account example. Let's assume we have introduced some bug in our domain logic that makes a mistake during a deposit that ignores decimals in a way that it considers £10.00 to be £1000:

immutability-dealing-with-change-4

Using a compensation event, we would emit the logical opposite of a deposit, in this case a withdrawal seen on the right, to compensate and bring our state to the correct one. Downstream consumers could observe the withdrawal (made more visible through metadata), and correct their local state.

immutability-dealing-with-change-5

As long as your domain allows for this to happen (an opposite action exists to fix the wrong one, and it's permitted legally) a compensating event can fix your error relatively quickly. Always make sure you add metadata to the event to indicate it is a compensating event for your audit trail if you need it.

Bear in mind that compensating events may be a quick and easy way to deal with errors, but your history no longer represents reality, unless you inspect metadata, and it could be a problem with downstream consumers. This is quite important in some domains. In fact, I am willing to bet good money that this is not something banks would allow you to do. However, this method still has it's uses in a lot of domains, or generic services that don't have strong audit requirements.

4.2. Revert and redo

Sometimes a bug may have caused us to raise an event when we shouldn't, or the event had the wrong data in it. In both of these cases, we need to declare to downstream consumers that we made a mistake and let them deal with it accordingly. To do this, we can:

Emit an event to fully reverse the wrong event, often referred to as a full reversal event. A reversal event should include metadata to correlate it with the event it's reverting.
Emit a new, correct event.

Let's again go back to our decimal problem from before::

immutability-dealing-with-change-6

With this method, fixing our deposit bug described earlier, could be visualised as below:

immutability-dealing-with-change-7

Emitting a full reversal event, aside from having the benefit of being informative, also explicitly notifies all downstream consumers about the mistake, and allows them to act accordingly to correct their state. In a pull-based system, where downstream systems process events using a pull-based model, it is even possible to automate some of this corrective process: for example on receiving the reversal event, a query projection could drop the existing data it holds, and rebuild its state from the beginning, skipping the 'undone' one.

While important errors in data are usually caught early, sometimes they remain unnoticed for a while, enough for subsequent actions to be taken. This is not a problem for commutative operations, but even for non-commutative operations, my experience has shown that even then a rever-redo process will most often work. If not, another of the options presented will work.

4.3. Follow a domain-specific process to fix the issue

In some cases, customer-facing errors are quite complex to fix. For example, if we decided wrongly that a loan has been fully repaid, we cannot just remove the flag and wipe our hands clean. This does not mean we just lost the business money, of course, but it does mean that the process to correct these mistakes is a more complex one than updating a value.

I have quite a lot of examples by now (I made my fair share of mistakes) of error recovery workflows, because these have to exist for legal reasons, or because people working on the domain developed and established strategies to fix complex mistakes. Whenever you discover errors in core workflows to your business and domain, I'm willing to bet that your domain experts know of a procedure to follow to fix it.

4.4. BONUS: What NOT to do

Don't emit an additional event with the correct values without an undo event, relying on idempotency to fix your projections. Not having to raise an undo event may seem less of an effort, but your projections (read or write path) may not stay idempotent while the events exist. You may face some nasty surprises in the future when you take this approach. Moreover, this way, the events in the stream don't represent reality.

Some event stores allow you to delete events selectively (delete the third event leaving the first two in place). Deleting events, even wrong ones, is also something I would advise you to avoid. Deleting a wrong event and adding a correct one, is the closest you can go to mutating data in an immutable store and will cause you to lose most of the benefits of using one. Moreover, if you delete one event, while it has newer in front of it, you may end up with insidious concurrency issues, that would be challenging to fix now that you deleted the original event.

5. Wrong transactional boundaries

Changing the transactional boundaries is the one change that is commonly more difficult to address with an event store. While the transactional boundary may not have a representation in software, it does have one in the store: the event stream.

Note: Only some mutable stores allow for find-grained control of transactional boundaries, and this operation is easier only in those cases. Typically these would be normal form databases, where such control is built-in. Reshaping documents in document stores, on the other hand, to accommodate new transactional requirements carries an equal (if not more) amount of effort as reshaping streams. In normal form databases, the only thing needed is to change the query to act or return a different view of the data stored.

In an event store, when we need to change our transactional boundaries, or significantly reshape the domain we have, the only option is moving events to new streams or already existing ones. We can do this using the copy-replace method we said above, but the logic that copies events over will probably need to be more complicated than what we explained before, to deal with multiple streams.

TIP: If you find you also need to split, merge, or move a property from one event to the other in addition to reshaping transactional boundaries, I strongly recommend that you:

Split events before reshaping streams

Merge events after reshaping streams

Move the property to other events only when the events are in the same stream.

Do the above during a separate deployment to reshaping the streams and allow time to make sure it works before proceeding with reshaping the streams

TIP: From experience, this type of problem happens more often when the domain is new for the company and happens quite rarely when the domain is mature with considerable in-house expertise. I would suggest you either avoid event sourcing if the domain is new with significant uncertainty and you need to experiment, or simply be prepared to make these changes as you discover them to allow your domain to reflect your understanding of the domain.

Conclusion

I hope you now see that using immutability for storing data that support business decisions, has advantages compared to mutating data. And I also hope I answered some of your questions about how to deal with domain discoveries, errors, and new requirements that require changes to your data. If you have any questions or comments, feel free to reach out to me @skleanthous on twitter.