Techniques for Building Resilient Software Systems


Downtime = lost revenue.

One of the most challenging aspects of software development is staging changes without breaking the service. Releasing new features always comes with a risk – bugs might be introduced and existing failure points might become more prone to failure.

The task is even more daunting when you have many microservices with multiple changes being checked in. The explosion of possible states has a combinatorial growth rate.

The following paragraphs discuss a couple of techniques I use. Most of these were learnt the hard way and have worked great for me especially when rolling out changes across multiple services with temporal dependencies and requirements.

1. Live configurations

Most times, configurations are specified at build time and consumed on startup. This is fine for most scenarios, however, static configuration files may be too rigid for dynamic use cases.

An example would be using configuration files to determine which clusters receive user traffic and which serve as back ups. A configuration mistake (human), scaling up/down to adjust to traffic or compute resource failure (cos distributed systems) necessitates a configuration update. In turn, that would require a production release. Your Mean Time to Repair (MTTR) is affected by the amount of time it takes to complete all updates.

An alternative is to have ‘live’ configuration that can be modified quickly. For example, you could read from a secure store. If things go wrong or you want to scale up, the fix is to update that configuration via an API or even a user interface.

I remember using this technique to fix a livesite issue. I had forgotten to update a required configuration for a new feature. Fortunately, the feature was designed with ‘live’ configuration otherwise the impact would have been worse. Ultimately, the fix was a single API call and viola! Issue gone!! MTTR? Less than a minute.

2. Fall backs

Using ‘hardcoded’ stubs to break in a new integration might help to ensure a smooth adoption while minimizing risk and not breaking the service. The major goal is that changes should be transparent to the users and things should not break.

I remember once trying to integrate a brand new v1 service into our offering. the challenge was that the v1 service we were going to depend on was in flux and things were broken at times – not pointing fingers or blaming them but that is expected when you are building things from scratch.

Consequently I had to mock the expected responses and thankfully the set of possible responses then was quite small – there were only two possible states. I hardcoded these responses into our configuration files, wrote the code in such a way that it would read the configuration files and return the expected response.

To the end-users, it didn’t make a difference however the beauty of it was that it gave us a way to slowly switch to the service without any downtime. And when the v1 service did break, all we had to do was flip the flag / feature switch and we would fall back to the configuration without having to do a code deploy or spin up a live site incident. Eventually the service became mature and the configuration fallback went away.

3. API versioning and updates

Breaking changes are inevitable – the service has to grow and evolve and things have to change. Even if you don’t make breaking changes, some underlying major piece that you rely upon may be rewritten and require you to make changes (Angular1 vs Angular2, Object.observe, EmberJS). The business has to keep running – most times, there is a seamless path for upgrades.

I have had to do a lot of breaking changes in recent times – API changes deep down in the stack, shared library updates and even controller responses. Most times, versioning offers a way out especially semver. What about times when versioning is not enough?

For shared libraries, when possible, I prefer to make changes opt-in without changing signatures; this helps to avoid unknown unknowns. Consumers who want to upgrade would then have to opt-in to new behaviour while old consumers have to do nothing.

A potential example would be upgrading to the latest version of a core library. If the signature changes, every consumer would need to make code changes even if they don’t really want to yet – that exposes you to risk and increases the chances of misuse.

The worst kind of scenarios would involve ‘silent’ behavioural changes. These updates significantly change the behaviour of the code while retaining the same signature. These can come back to bite months after the change or even after the author has left the team.

For resiliency, I prefer sending back objects instead of primitives when possible. This allows updating the response with more fields and/or removing deprecated ones without having to change method response signatures. This isolation ensures there is a shared contract that both the producers and consumers can version and rely upon.

4. Self-healing code

What happens if you paint yourself into a corner and can’t get out anymore? This happens more often than you think. The ‘live configuration’ technique mentioned in 1 works in some cases but is not a silver bullet.

Caches offer multiple real-life examples of this problem.

There are only two hard things in Computer Science: cache invalidation and naming things. — Phil Karlton

It is common practice to cache the results of expensive operations; these could be clients for connections, resource statuses or even ids. Let’s say you have a cache of values that you think would never change; what then happens when they do? For example, a cached client connection might become invalid due to an outage. A cached SQL connection string might also change due to configuration updates (password change, access mode updates etc.).

When such expected things happen, it might trigger cascading effects downstream. A remote service might start failing due to failing connections, queries might corrupt some essential fields and so on.

A simple fix is to build self-healing mechanisms into your caching code. A cache can either expose a ‘complaint’ API or self-validate internally.

    1. Complaints
      An example of this pattern would be Azure service fabric which has a DNS-like resolution of services running on nodes. Because services can move to new addresses, it is possible to get invalid addresses from service fabric.
      When that happens, the consumer can ‘complain’ to service fabric which will force it to update the internal cache of addresses. This is the 410 class of responses from service fabric.
    2. Self-Validation
      This is a step forward from the complaints method above, the cache could validate its own contents before sending back values to consumers. This approach is great for inexpensive operations.An example would be a resource client for operations with some resource which uses an internal cache. Such clients can run scheduled cursory checks on cache values’ validity or even add a safety check on lookup (provided they are not expensive).

If these techniques are used, then they take the pain out and you can rely on your code ‘self-healing’ when unexpected things happen.

Conclusion

What tricks do you use to get code in safely without causing user impact or downtime? Share some.

Related

If you enjoyed this article, you may like the following:

  1. A framework for shipping high quality software
  2. Creating Great User Experiences
  3. Things to check before releasing your web application

Leave a Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s