When sleeping dogs bite: Unmaintained systems breed disasters


Let sleeping dogs lie – avoid interfering in a situation that is currently causing no problems but might do so as a result of such interference.

Why sleeping dogs should make you anxious

There is a general belief that legacy systems, stale code, and archaic tooling are sleeping dogs – we avoid them fearing a collapse from interference or deprioritize them for seemingly more important projects. I think this strategy of intentional avoidance and/or deprioritization is perfect for disasters.

The challenge with systems that do not fail is that they have no fixes if when they eventually fail. Thus, it is more urgent to fix these systems than is usually accepted.

The False Sense of Security

Imagine moving into your dream house; to your dismay, you discover several vermin-infested areas. You have two options for dealing with this unpleasant discovery:

  1. Hire an exterminator
  2. Barricade the vermin-infested areas
OptionProsCons
Hire an exterminatorPermanent fixExpensive – costs time, money, and effort.
Barricade vermin-infested areas Quick, cheap, and easy fixTemporary fix, vermin can spread out of control

The sub-optimal fix: Barricading

You choose to cordon off the vermin-infested areas and warn your kids about venturing into off-limit areas. Barricades are not foolproof though; it is only a matter of time before the bugs spread.

Every once in a while, a cockroach squirms out and everyone scrambles to stop it before it spawns a new infestation. It becomes a never-ending game of whack-a-mole identifying and plugging new holes. Soon enough; you start wondering why you bought the house – even exterminators won’t come near the house without charging an arm and a leg.

The better fix: Hire experienced exterminators

A better strategy would be to get experienced exterminators who go from room to room, clearing, and fumigating the house. This might take time, cost money, and unearth disruptive discoveries; however, this is the only safe way to guarantee a bug-free home.

How a legacy system lulled my team into a false sense of security

I learned about the perils of barricades the hard way a couple of years ago. I was the tech lead for the service foundations crew; my team was responsible for the core services and infrastructure powering the entire team. This platform role required an eclectic mix of software engineering, devOps, and SRE-type duties.

Originally, there was an org-wide platform team that owned all the infra and common framework libraries. However, that team got disbanded after episode 59787, The Yearly reorgTM series. My product group had to take up ownership since we owned > 70% of the services on the orphaned platform.

The platform team dropped a bombshell during the transitions: there were 3 services that hadn’t been touched in eons. These services worked fine however no one knew how they worked, what to worry about, and where to look for solutions. We took ownership and hoped that the dogs would continue sleeping; after all, we needed to focus on the higher priority projects.

Two years later, a tarantula crawled out…

The invisible platform cracks

It started with reports of missing logs on our dev cluster; whole swathes were missing in request-response logs which made it impossible to debug code. A few days later, we rolled out some platform fixes and lost all logs from our canary cluster.

This observability loss crippled the team and showed no signs of relenting – more tarantulas kept emerging and we needed to rapidly contain the infestation. Out of the window went all planned work.

The symptom was always the same: a healthy monitoring app (MA) but zero logs. Unfortunately, no one knew how to debug the MA service; heck, we didn’t even know how it worked under the covers!

The logs loss occurred whenever a node restarted. After inspecting the machine logs, we noticed that patches and upgrades were triggering restarts; so we explored preventing restarts as a short-term fix. We abandoned this approach since it was tricky figuring out how to reliably turn off restarts. Furthermore, it exposed us to considerable security risks.

Redemption

The last resort was to cut the Gordian Knot – start out afresh and deploy a brand new monitoring app. Starting from scratch was facilitated by our top-class engineering systems, these multipliers obviated all stumbling blocks to value delivery. For example, we could release a brand new app across 13 regions in 60 minutes if needed; without this capability, this approach would have run into serious roadblocks.

The biggest challenge was building and testing the new version of the MA. Outdated docs sent me down the wrong path initially and consulting internal experts provided no relief – it was one of those systems that just worked and lured everyone into assuming they knew how it worked under the covers – a classic case of Dunning-Kruger. Finally, I got help from the internal team that built the MA itself.

I then provisioned required resources (e.g. log accounts, Kusto clusters, etc.), set up data processing pipelines, and finally linked the new accounts to the new MA version. Another challenge emerged after the changes were deployed to the internal cluster, I discovered that the new and old versions of the MA could not run side-by-side. There was no way to avoid downtime during rollout since the old service had to be removed before deploying the new one.

This downtime implication necessitated a weekend rollout – the easiest way to minimize the impact of log loss during swaps. We completed the global rollout on a Saturday (1-hour intervals for each cluster); by Monday, everything was back to normal.

The cost of redemption was extremely high – it entirely disrupted the team, required working round-the-clock for days, and keeping a bazillion stakeholders informed.

Conclusion

Once bitten, twice shy. Sleeping dogs make me nervous, they are disasters-in-waiting especially when no one truly understands them. In our scenario, even though the MA was unchanged, the underlying platform changed!

The common belief is that nothing can go wrong if nothing is touched. This is wrong, it is only a matter of time before a monster emerges from the deep dark depths.

There are too many variables in software development to feel secure in a barricading strategy – patches get deployed, platforms get updated, systems get restarted. The best guarantee is achieved by having reliable systems and engineering multipliers.

If you have critical systems in maintenance mode, you should prioritize stability investments; do not wait for bugs to crawl out.

FAQ

  1. How do I prioritize stability investments when I have high-priority projects?
    Estimate the risk of something going wrong in the barricaded service. What would be the fallout? Will it cripple the business? If there is an unacceptable level of risk, then surface this concern and get engineers working on this asap.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.