The complicated parts of leadership: Trust and Verify

Introduction

“Our efforts for the past two months have had no noticeable effects on our metrics; what will we tell the VPs next week?”
My PM counterpart

Background

I broke out in cold sweat; how would I tell 2 VPs and 4 directors that the project was off track after 3 months? I dreaded entering the room and already conjured the 1001 ways I would get chewed up and then summarily spat out… like bubble gum.

I felt helpless: despite all the hard work, the metrics weren’t improving; they were nosediving! And I was expected to talk about this budding failure to this group?! Each of the 4 executive directors led organizations of about 200 – 400 people, while the VPs led organizations well into the thousands. These were folks with multiple decades of experience. How did I get here?

Three months earlier

I was leading the most critical project for the organization’s long-term survival. The project had started on a rocky note with some bickering – one exec blatantly accused a peer of being the source of our organizational woes. To the accused’s credit, he took it magnanimously and offered an olive branch – he would make all available resources to solve this pain point once and for all.

My team, responsible for my organization’s chunk of the project, needed to collaborate with multiple platform teams for success. We had to rethink the architecture, engineering systems, and release mechanisms to meet the strict performance and reliability targets. Meeting these goals would enable us to use the new platform features.

Most of the engineers on the team (including me) were new – we had gotten reorged for the umpteenth time again. This lack of familiarity meant we had to undergo onboarding – learning about the existing software architecture and assumptions from the original authors.

After familiarizing with the existing systems, we proposed changes to meet the perf and reliability constraints. We did a system design review with the original team, who assured us of our plans’ validity, and we believed them.

We collaborated with the platform teams on the engineering fixes and soon rolled out the first version. The results were subpar; however, no one raised eyebrows since it was the first month. Then, the second month rolled by, and the results were no different. In fact, they were worse! Something was wrong somewhere – localized runs confirmed that the engineering fixes improved both perf and reliability numbers; however, analysis of aggregated production runs showed otherwise.

We could not present these results at the next leadership sync without having a reason.

Don’t miss the next post!

Subscribe to get regular posts on leadership methodologies for high-impact outcomes.

Join 300 other subscribers

Solving the 1 + 1 === 3 mystery

I returned to the first principles to understand the root cause. I started with a few outliers and painstakingly went through each log line. Soon, a pattern emerged; all the instances with extremely long installation times had two major installation events. That was odd since we should not be reinstalling the same package versions.

After digging deeper, I realized the system was running reinstallations as part of deletions! This unexpected and unexplainable behaviour was the source of our woes. I found it hard to believe the facts even though the figures proved this was happening. So I set out to confirm with the engineers who initially set up the system.

So I set up a website with visuals to show the timelines and then prepared a clear email pointing out the issues and calling out that the data was off and wrong.

It took me nearly two hours to prepare the details clearly and succinctly; I made a web app visualizing the timeline for problematic instances, crafted an email with clear call-to-actions, and proofread it multiple times. The effort paid off as it proved incontrovertibly that deletions involved installations.

Shortly afterwards, the original authors responded: oh yeah, sorry about that; we forgot to tell you about that initially because it has always worked that way.

Remediation

Decontaminating the data

The first step was to validate the extent of the damage by analyzing a subset of the dataset. The analysis of the narrowed dataset revealed improved performance and reliability – we could exhale: the engineering changes worked.

Then the next step was to update the entire pipeline to exclude those invalid instances. Fortunately, excluding resurrected zombie instances from the pipeline was straightforward and the project was back on track.

Owning the story

We had to inform the leadership about the near miss; that was the honourable (albeit uncomfortable) thing to do. We came clean about the challenges we faced, what we learnt, the impact on the project, and the remediation steps.

The top four questions the exec briefing covered were:

What could have been better: We trusted without verification
What we learnt: Deletions triggering installs threw a wrench in all our analysis and assumptions.
What it meant: Even though we had nearly lost 2 months of progress, the project was still on track.
What we’d do to prevent recurrences: Spot checks, improved verification

Outcomes

We were able to go all out and complete it ahead of time. I still shudder when I think of other less favourable outcomes: suppose the data inaccuracies masked serious issues and set us back multiple months? Such an outcome would have severely damaged trust.

Lessons

Trust and Verify: I didn’t double-check because I trusted the sign-offs blindly; that embarrassing slip nearly cost us two months of hard work. A few tips: have leading indicators to validate or find outsiders to poke holes in your ‘fail-proof’ systems; verification is more work but can save you many blushes.
Own the message: When a project runs into a crisis, what matters most is taking ownership and coming up with solutions. In most cases, how you got into the situation pales compared to how you handled the problem. Informing the execs about our travails was scary, but it was the right thing to do and earned us even more trust and respect. Sweeping bad news under the carpet is like hiding spoilt milk – it doesn’t get better with age.
Seek fast feedback loops: The longer it takes to get results, the more likely you miss detecting failures. Fast feedback loops help you validate risky assumptions; for example, we would have spotted the discrepancies earlier if we had daily deploys.

Previous story: The complicated parts of leadership: Betting on people

Next story: minimize chaos

Don’t miss the next post!

Subscribe to get regular posts on leadership methodologies for high-impact outcomes.

Join 300 other subscribers

Introduction

Background

Three months earlier

Don’t miss the next post!

Solving the 1 + 1 === 3 mystery

Remediation

Decontaminating the data

Owning the story

Outcomes

Lessons

Don’t miss the next post!

Share this:

Related

Leave a comment Cancel reply