My foremost goal while building software is to build stable self-healing systems with deterministic behaviour. I want to ensure my code continues to work even when unexpected events occur. In the event of unknown unknowns, the expectation is a graceful degradation in the worst case.
This requirement is even more important when implementing high-risk projects. Some examples include moving users to new clusters, processing background jobs on unstable clusters or even moving data to new locations.
The overarching goal of stability and guaranteed behaviour is built on some fundamental sub-goals. These include:
- Zero customer impact – whenever possible a change should be transparent to customers. Changes should not impact usage of the software.
- Transactional changes – changes should be atomic and either succeed or fail.
- Resiliency – The change should be resilient to failures in the underlying stack (e.g. outages, clusters failures or cancelled operations). The system should be smart enough to self-heal without impacting users.
Some tasks might initially appear to be impossible without customer impact, however most times, with some effort and investigation; there are workarounds. For example, it is possible to do a live data migration by setting up two stores and slowly killing off the old store. The deciding factor as always should be the return on investment – is it worth creating engineering solutions to work around the customer impact?
The following steps highlight some techniques I employ to ensure my software works as expected. They leverage concepts from computer science and Mathematics.
Yup, same-old classic truth tables from Boolean logic. They are my favorite go-to tool when modelling new conditions and have saved me from having “Who would have thought that could happen?!” post-mortems. I learnt about Truth Tables a long time ago from Code Complete and have used it ever since – they help to reduce the unknown unknowns.
A truth table minimizes the chances of missing out on corner cases; the comprehensive analysis required quickly exposes the unexpected combination of input/events that might trigger ‘funny’ behaviour. Moreover, as an added benefit, the outcomes makes it possible to simplify logic using the logical operators; a quick example is using XOR to describe exclusivity conditions.
Let us take an example scenario involving two types of token-bearing requests: tokenA for paying users and tokenB for free users. You could attempt to mentally model all kinds of conditions and actions based off token types but putting it down in truth table will bring some strictness/rigidity that will help you avoid “hmm, I forgot about that possibility or I didn’t think about that” situations.
The truth table will even force you to think about conditions that you wouldn’t have thought about – what happens if a user passes in two tokens? What if he passes in an invalid token? For situations you don’t care about, you can mark them as ‘don’t care’ conditions.
|1||0||1||is B type|
|1||1||0||is A type|
|1||1||1||DON’T CARE? / How can this happen?|
The table shows a few don’t care conditions since those should not be possible in the code; but if they are you should handle them or at least log them.
For complex tables, you can employ techniques like the Karnaugh map or canonical sum of products to get a simplified expression. However, the risk is that you might end up with expressions that are valid but difficult to understand in code. In some domains (e.g. chip design) this may be fine but in software engineering, code legibility is a big win. So be careful.
Another technique would be using invariants to be sure that your code works as it should. Most of us are exposed to invariants in the classical looping algorithms or in assert statements however their applications are not limited to these two cases alone.
What I like most about invariants is that they make it possible to think about large systems by establishing isolation boundaries. An invariant condition guarantees a behavioural contract between system components and helps to provide correctness guarantees. Moreover, they provide a basis for reasoning about systems and abstracting away common patterns.
Say for example you had to do transactions on some arbitrary value (e.g. money, commodities etc.). A quick invariant is to always establish that the net change of all operations is always zero; i.e. the total value in the system stays constant.
As shown in the table above, the total value at the beginning of the transfers is the same as the total value at the end. Having this check quickly allows you to verify the validity of your code and identify bugs.
This same concept can be applied to data migration, sign ups or even routing users to clusters.
There is a difference between I think it is true because it works for x number of cases and it is true because I can prove it. I am not advocating for hard-core mathematics which might only be needed in certain scenarios (graphics, compilers etc.); however having a basic understanding of some mathematical concepts helps to make our jobs easier.
A good example of this is sets which are more ubiquitous than we care to admit; they provide a way to quickly build and validate software models. For example, finding users of a particular type is an intersection operation.
Another example is with queues, Little’s law comes in very handy when building queue-based processing systems. Quick back-of-envelope models help provide a reasonable ballpark figure for consumption, delivery and processing times.
What I like most about mathematical reasoning is that it provides a language for expressing concepts. This provides a solid foundation for thinking about even bigger abstractions in turn.
One area I would love to explore is applying concepts from sequential logic to software modelling. The hardware industry has been around for a long time and they have some excellent techniques for ‘memory’ and memory circuits.
I wonder how feasible it would be to model complicated scenarios using flip-flops and if such approaches will improve code readability and stability.
What other tools do you use to guarantee code correctness?