Software engineers, technical leads and managers all share one goal – shipping high-quality software on time. Ambiguous requirements, strict deadlines and technical debt exert conflicting tugs on a software team’s priorities. Software quality has to be great otherwise bugs inundate the team; further slowing down delivery speed.
This post proposes a model for consistently shipping high-quality software. It also provides a common vocabulary for communication across teams and people.
This framework is the culmination of lessons learnt delivering the most challenging project I have ever worked on. The task was to make a web application globally available to meet scaling and compliance requirements.
The one-line goal quickly ballooned into a multi-month effort requiring:
- Moving from a single compute resource based approach to multiple compute resources.
- Fundamental changes to platform-level components across all micro services.
- Constantly collaborating with diverse teams to get more insight.
The icing on the cake? All critical deployments had to be seamless and not cause a service outage.
What’s Donald Rumsfeld gotta do with software?
He’s not a software engineer but his quote below provides the basis.
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.
– Donald Rumsfeld
His quote is a simplified version of the Johari window from Psychology. Applying this to software, the window would look thus:
|What the developer knows||What the developer doesn’t know|
|What other developers know||Known||Unknown known|
|What other developers don’t know||Known Unknown||Unknown Unknown|
1. The known
Feature requirements, bugs, customer requests etc. These are the concepts that are well-known and expected. However, writing code to implement a feature may not guarantee a full known status. For example, untested code can still be a known unknown until you guarantee how it works.
It is one thing to think code works as you think it would and it is another to prove it. Unit tests, functional tests and even manually stepping through every line help to increase the known.
2. The known unknown and the unknown known
I am collapsing both halves into one group because they are related.
These are aspects that the developer knows about but other engineers in partner teams don’t know. A good example would be creating a replacement API and making an existing one obsolete. Another example would be changing the behaviour of some shared component.
These are the aspects that the developer doesn’t know but engineers in other teams know about. For example, a seemingly minor update of a core component by a developer can trigger expensive rewrite cascades in partner teams. Another example could be quirks that are known to only a few engineers.
Clear communication is the one good fix for challenges in this category. Over-communicate! Send out emails, hold design reviews and continuously engage stakeholders.
This is extra important for changes with far-reaching impact. As the developer/lead/manager, you need to spend time with the key folks and understand their scenarios deeply. This would lead to better models as well as help forecast issues that might arise.
Finally, this applies to customers too – you may what the customer doesn’t know about and vice versa.
3. The unknown unknowns
This is the most challenging category. There is no way to model or prepare for something unpredictable – an event that has never happened before. Unknown Unknowns (UUs) include hacks, data loss / corruption, theft, sabotage, release bugs and so on.
Don’t fret it yet, the impact of UUs can be easily mitigated. Let’s take two more metrics:
Mean time to repair (MTTR)
The average amount of time it takes to repair an issue with the software.
Mean time to detect (MTTD)
The average amount of time it takes to detect a flaw.
The most reliable way of limiting the impact of UUs is to keep the MTTR and MTTD low. Compare the damage that a data-corrupting deployment can cause in 5 minutes versus 1 hour.
A rich monitoring and telemetry system is essential for lowering MTTD metrics. Log compute system health usage metrics (RAM, CPU, disk reads etc), HTTP request statuses (500s, 400s etc.) and more.
Ideally, a bad release will trigger alarms and notify administrators immediately it goes out. This will enable the service owner to react and recover.
Having a feature toggle or flighting system can help with MTTR metrics. Again using the bad release example, a flight/feature toggle will enable you to ‘turn off’ that feature before it causes irreparable damage.
Also critical is having a quick release pipeline, if it takes two days to get a fix out; then your MTTR is 2 days+x. That’s a red flag – invest in a CI pipeline.
A software engineer is rolling out a critical core update, a few questions to ask:
- Does he have enough logging to be able to debug and track issues if they arise?
- Is the risky feature behind a flight or feature toggle? How soon can it be turned off if something goes wrong?
- Are there metrics that can be used to find out if something goes wrong after the feature is deployed in production?
A release strategy is to roll out the feature in a turned off state and then turn it on for a few people and see if things are stable. If it fails, then you turn off the feature switch and fix the issue. Otherwise, you progressively roll out to more users.
What steps do you take to ensure software quality?