Lessons learned from running services at scale: 1

My day job nowadays revolves around keeping our services up and writing a lot of backend code. Two years ago though, it was radically different – I was a frontend engineer. A great way to acquire skills fast is to be thrown into the deep of it (sink-or-swim). I am not sure it is the best way for newbies to learn but that is a story for another day.

This post describes some of the lessons I have learnt in the past two years.

1. Can’t go wrong with logging

The goal of logging is to make it easy to validate behaviour and debug errors when things go wrong (and they inevitably will). You need to log informational (e.g. server locations, server types), warning (e.g. unexpected scenarios) and error (e.g. null references, exceptions) messages.

Watch out though – don’t go logging Personally identifiable information (PII) data!

Logging good information is an art – it takes skill and experience to craft log messages that can be used to build dashboards, confirm expected usage or debug error conditions. The goal is being able to answer questions without having to load code or open an IDE. With adequate logging, it is possible to quickly isolate and fix buggy methods or unexpected usage patterns.

You can’t have too much of logging though; so when in doubt, add more logs. They should be cheap anyway and have little or no overhead costs (e.g. performance etc.)

2. Infrastructure as Code

As your service grow; you’ll need to expand to scale up your resources. If you rely on manual steps, then the effort required to spin up a new compute resource grows linearly with expansion. Apart from the high cost, this approach is also tedious and error-prone; the slightest mistake can lead to a business outage.

So how do you get some sanity on it and pain-free deployments? You could invest in scripts and fail-safe checks or you could look at Infrastructure-as-code (IAC). Having used both; I strongly prefer IAC approaches.

IAC allows you to document the desired end state and can be less verbose compared to scripts. Moreover, they usually support checks like config validation and incremental rollouts. These checks ensure that you do not make mistakes or wipe existing resources while attempting to deploy new changes.

It becomes very easy to spin up new instances. Typically, you run the same template with different parameters. With ARM templates, for example, all you have to do is create a new parameter file and run it against the template. Viola! You can go do something else while waiting for your new resources to come online. This ability to cheaply roll out new resources is a huge boon to any business.

The problem with IAC though is the steep learning curve, unavailability of certain actions via IAC approaches and the restricted eloquence of declarative approaches. However, in my opinion, the peace of mind and ease of use are bigger benefits.

3. Self-healing Systems and Code

Can your systems take care of themselves? Say, there is a power outage or someone cuts the optic fibre, are your systems smart enough to detect the outage, route traffic to other nodes and then attempt to bring up the dead nodes. One big advantage of using microservice orchestration platforms (e.g. Service Fabric, Docker swarm, Kubernetes) is the automatic maintenance of the desired state.

Say you want to always run 7 instances of service X running; such systems can detect outages, spin up healthy replacement services and route traffic to them. All of this happens without requiring human input.

The alternative to this would be implementing one’s own gateway and load-routing systems, wouldn’t that be a lot of code and systems to monitor?

I had to wipe some corrupt metadata created by a critical code path recently. It was an easy decision for me because I had written that code to be self-healing; consequently, it would recreate the proper metadata if it was missing. Had it not been self-healing, that problem would have been very painful to solve.

4. Automated testing helps a lot

Apart from refactoring code and improving code quality; the bulk of the code I have written in the past 12 months has been on a distributed database provisioning and upgrade system.

The code to provision databases is a long-lived operation and requires polling multiple endpoints. The manual testing and verification during development was a slow demanding process. I was able to use PostMan to mitigate most of this but testing in production posed another challenge.

I made the call to run all tests via automated runners in all environments and that is one decision I am very glad I made – it paid off in manifolds eventually.
Getting the runners set up was expensive (took about a two-week sprint): the test runner framework had to be revamped, the tests had to be written to run in parallel and pass within expected time limits and the code had to be deployed to multiple regions. I also had to address some test set up and clean up concerns.

Once set up though, the benefits were immense:

  1. The automated tests provided an easy way to gauge the ‘health’ of the platform since they were constantly running every day. An ask for metrics (reliability, times, stability etc.) just required pulling up the logs.
  2. As new scenarios were added, it was very easy to add in test coverage. For example, a feature that shipped required about 5 lines of code for automated testing.
  3. The biggest win was the exposure of latent bugs in the underlying systems we depended upon. The runner tests would trigger code paths that manual tests wouldn’t have covered (e.g. rapid calls, cross-regional API calls and load testing). Consequently, we discovered new failure modes. My favourite was the revelation of data inconsistency during data replication across regions.

Looking back, I shudder to think of how we’d have coped otherwise.


There are a few more things too but those should be in Part 2.



  1. How to build Resilient Software
  2. Essential Pillars for running a service at scale

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.