Essential Pillars for running a service at scale


Software services need a solid foundation that guarantees near 100% uptime. The work needed to establish such a base is termed devops, infrastructure or platform.

About 18 months ago, my team got a new charter: launching a brand new service. I was involved in the setup of new platform resources as part of that effort.

A retrospective of the issues faced over time exposes recurring themes.At a low-level, the tasks appear orthogonal. However, the bird’s-eye view does show that most of these coalesce into five major pillars. I love patterns because they reduce cognitive load; once understood, the same models can be reused several times.

I came up with the acronym SMART to describe these pillars. SMART means Stable, Measurable, Automated, Resilient and Tractable. Just as Joel’s famous test rates software teams; SMART serves as a similar gauge to rate infrastructure / platform health.

SMART provides

  • A shared vocabulary for describing infrastructure pillars
  • A quick yardstick to rank investment areas
  • A quick scorecard to measure compliance across disparate resources and platforms.

1. Stable

Some examples of areas that fall under this pillar include:

1. Security

A stable system is secure; how easily can the system be patched? Updated? Upgraded? This also includes responding to security incidents and malicious actors. Can you rapidly roll certificates without any customer impact?

2. Isolation barriers

Isolation barriers help reduce the blast radius of problems. A good example of this is the ‘noisy neighbours’ problem when using shared resources. Issues affecting a service should not impact co-located services on the same host.

3. Enforcing good behaviour

What happens when a user goes above and beyond expected limits? Does the system have capabilities to throttle users if they send a deluge of requests in or does it just crash?

2. Measurable

This is probably the most important pillar as it is required in tandem with the other pillars.

A good measuring system exposes information about the whole stack and business. Business, engineering and usage metrics should be easily accessible. Well, how would you know how you are doing if you don’t measure? If you are running blindly, you are running at a huge risk.

Most teams have disparate non-consolidated measurement systems – a symptom is the existence of multiple team-specific dashboards. While this is better than nothing, it makes it difficult to come to a shared understanding or deduce strategic high-level insights. Consider investing in a single consistent measurement hub – one accessible and understandable by all.

Examples of things that should be measured include, these are arranged as you go up the stack (from bare metal to bits)

  • Resources (location, availability, uptime etc.)
  • For specific compute resources, metrics like memory, CPU, disk space
  • Request metrics (e.g. for HTTP, this would include request types, duration, location, success and failure rates).
  • Logging: The ideal logging solution is SMART too. Ideally, you should be able to triage, identify and resolve bugs from logs before customers report them.
  • Alerts for catastrophic errors
  • Usage metrics (e.g. User usage flows, acquisition funnels, bounce rates, stickiness, etc.)
  • Revenue metrics (e.g. Cost of Goods Sold (COGS); how much does it cost to acquire a single customer? Are you running at a profit or loss?

3. Automated

Automate! Everything should be configured, tested and then automated. Running a production system without automation is just asking for trouble. And believe me, I have been burnt enough times and lost enough sleep.

A good example is having automated resource creation as opposed to manual roll outs. A battle-tested automation system guarantees a reliable stable system end product. It’ll be faster and easier to scale too compared with relying on engineers to run esoteric commands.

I am a huge fan of infrastructure as code (IAC). IAC solutions offer the following benefits:

  • IAC allows you to document the desired end state; one source of truth for anyone working on the team and a veritable reference point.
  • IAC are faster to execute compared to manual validation.
  • IAC can offer guarantees (i.e. error checking, validation and compliance). Manual interventions are risky because of the possibility of impacting existing systems. IAC helps mitigate this.

Another part of automation is continuous delivery and integration systems. Developers should not have to worry about getting their features out into production manually. A great delivery platform would take care of testing changes, slowly releasing them in production and notifying engineers of progress and issues. Even better, it can self-heal and rollback problematic changesets.

Automated systems allow engineers to focus on delivering the critical business features. This implicit trust that automation will handle mundane issues helps accelerate value delivery.

It might not be possible to automate every thing. In such scenarios, strictly ensuring there is a single entry point helps to avoid diverging automation paths. Automation is great because it helps to ensure consistency; if the consistency guarantee is broken, then it might be even harmful.

4. Resilient

A system should be resilient to failures, increased load and the 1001 other things that could happen. A popular advantage of cloud hosting is elastic scaling which comes at a cost but that’s not the only facet of resiliency.

Things to consider in this bucket include adding fail-over resources that provide a backup for critical systems. Thus if you lose a critical resource, you can always switch to a new system. Another is having tested backup systems. A non-tested backup system is just as bad as having no plan.

Just like in security, a resilient system is only as strong as its weakest link. If you have top-of-line resiliency plans that can only be executed by one person, then your bus-factor is effectively one. All it takes is for that engineer to leave your company to expose a gaping hole. I like the model of enforcing shared ownership/understanding for critical systems. To fix, document critical recovery practises and have drills to improve shared team ownership.

Investing in self-healing systems also pays off in the long run. One advantage of using such systems (e.g. Azure Service Fabric, Docker Swarm or Kubernetes) is their smart self-balancing act. Such systems can always spin up new instances / services to match the desired state specified by the service author.

5. Tractable

Systems should be easy to control, deal with and modify.

Emergencies occur every now and then. When they happen, you don’t want to be stuck with hard-to-change systems, you want the capability to move fast and fix.

The first 4 pillars help to isolate and mitigate the impact of issues but won’t protect you from software bugs with widespread impact. The fix for this class of problems is having a tractable system.

Things that fall into this include

1. Flighting/feature flags – I love flights because they allow you to safely roll out new features. If somethings breaks during the roll out, you turn off the flight and return users to a good state. Problem solved.

2. Standard Operating Procedures (SOPs) – Well-written actionable SOPs help in resolving problems. Any one can use such to handle live incidents without involving more team members. This is why documentation (up-to-date, easily accessible and consistent) is a big win. It makes your platform more serviceable.

3. HotFixes – Do you have a way to quickly get in a code fix into a specific resource? Automated systems can be inflexible since they are designed for generic scenarios. Having no quick shortcuts is a limiting factor that can hold up your value delivery. Thus consider having safe, well-defined bypass systems.

4. System support – sometimes, we all have to wake up and get on that call for that severity 0 issue. At least having all these in place, should help get to a great state as fast as possible.

How a SMART system protects you from common problems

  • A resource region is down/corrupt
    • Measurable – Alerts fire to notify owners
    • Resilient – switch to a backup
    • Automated
      • Spin up a replacement OR
      • Fix the original server
    • Resilient – route traffic back to the replacement
  • A buggy feature has rolled out
    • Measurable – Error logs start coming in, alerts fire
    • Tractable
      • Turn off the feature flag
      • Roll back that feature
  • Someone is trying to hack into the system or mount a DOS
    • Measurable – Alerts fire
    • Stable
      • Load can be safely handled
      • User is throttled
  • The Slashdot effect – your feature is top of the news, unexpected usage levels
    • Stable – System responds and copes with load
    • Measurable – Alerts fire when thresholds are crossed
    • Resilient – Systems start scaling out to handle an increased load
  • Someone inadvertently deletes data
    • Resilient – Backups exist
    • Tractable – Standard procedures on restoring data exist
  • A service goes rogue and consumes all memory or CPU
    • Stable – Noisy neighbours don’t have any impact on co-located services
    • Measurable – Alerts fire so the rogue service can be fixed
  • Need to spin up new resources in a new region
    • Automated – rollout (order of minutes)
    • Tractable – deploy all systems on the new resource
    • Measurable – ensure the new resource is in a valid state
    • Resilient – Activate and plugin new region into existing infrastructure

As is obvious, the measurable pillar is the most important – it is what ensures you are active and running as is! Don’t run your service blindly!

Think infrastructure, think SMART.

9 thoughts on “Essential Pillars for running a service at scale

  1. I have never worked on systems of the complexity you describe. It sounds like science fiction to me. One thing I’m sure you have, but don’t explicitly mention are visualization systems. For systems as complex as you describe, there have to be tools to make sense of very large datasets quickly. Also, having an untested backup plan is considerably worse than having no plan because it gives you a false sense of security. Great article.

    Like

    1. Thanks Joseph.

      Great point – I implicitly assumed visualizations would fall under measurable (dashboards etc.). They definitely offer a direct pathway to analyzing problems.

      And yes, there are loads of tools to make sense of it all. The complexity can be mind-boggling.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.