As all of us who have released production software before know, if something totally rare and unseen will happen it almost always happens the day of release. In our shared career histories, we’ve personally seen many such examples: routers going bad, disks failing, servers with bad memory, admins blocking the ports you use on the firewalls, admins switching from two-phase to three-phase commit on the DTC (why did we increase in latency by 40%?!), and of course we all love dealing with multiple levels of managed providers. Literally, we have seen all sorts of fun stuff (and we’d love to hear your stories on Twitter – @eventstore)!.
Yesterday we pushed the source for Event Store 2.0.0 so some early adopters can start looking through/testing with/using. We are currently working on making sure the binaries are stable before releasing them along with the NuGet packages. We also released the Event Store website. As we discussed yesterday this is one of our clusters that we have automated for power failures/network partitions. It does about 1,000 power failures per day to help make sure you don’t have a problem when/if you happen to lose power.
At about 0300 local time last night we ended up losing a node. The data hard drive actually went bad – who would have thought such a thing would happen on the day you release?! Windows was not so happy either - the box entered a zombie state. Once we were able to get it to boot up about all we could get it to do was…
…telling us that the disk is reporting device errors. Uh oh…
However the cluster continued running throughout the night (it actually processed another 3GB of writes before we got to it). The power outage test stopped pulling power because it would break the quorum which is what the replication model is based on. We have other tests that actually break the quorum on purpose.
This is the reason that we have the clustering. A single catastrophic failure did not bring down the cluster. A five node cluster can handle two catastrophic failures. Remember that you will never run out of things that can go wrong!
When we came in the morning, we replaced the hard drive and turned the appliance back on. It took about 15–20 minutes to get fully synchronized again, but then happily joined the cluster again the and the power off test resumed.
As for the busted hard drive, we have a feeling that pulling the power on it a few hundred times a day for weeks on end may not have been the intended use but luckily since we’re using an Event Store appliance, the send home warranty covers it. Even more luckily, it does not have far to travel!