You're reading the free online version of this book. If you'd like to support me, please considering purchasing the e-book.

Of Outages and Incidents

I used to tell my people: "You're not really a member of the team until you crash production".

In the early days of the company, crashing production—or, at least, significantly degrading it—was nothing short of inevitable. And so, instead of condemning every outage, I treated each one like a rite of passage. This was my attempt to create a safe space in which my people could learn about and become accustomed to the product.

What I knew, even then, from my own experience is that engineers needed to ship code. This is a matter of self-actualization. Pushing code to production benefits us just as much as it benefits our customers.

But, when our deployments became fraught, my engineers became fearful. They began to overthink their code and under-deliver on their commitments. This wasn't good for the product. And, it wasn't good for the team. Not to mention the fact that it created an unhealthy tension between our Executive Leadership Team and everyone else.

The more people we added to our engineering staff, the worse this problem became. Poorly architected code with no discernible boundaries soon lead to tightly coupled code that no one wanted to touch let alone maintain. Code changed in one part of the application often had a surprising impact in another part of the application.

The whole situation felt chaotic and unstoppable. The best we thought we could do at the time was prepare people for the trauma. We implemented a certification program for Incident Commanders (IC). The IC was responsible for getting the right people in the (Slack) room; and then, liaising between the triage team and the rest of the organization.

To become IC certified, you had to be trained by an existing IC at the company. You had to run through a mock incident and demonstrate that you understood:

How to identify the relevant teams.
How to page the relevant teams, waking them up if necessary.
How to start the "war room" Zoom call.
How to effectively communicate the status of the outage and the estimated time to remediation.

This IC training and certification program was mandatory for all engineers. Our issues were very real and very urgent; and, we needed everyone to be prepared.

Over time, we became quite adept at responding to each incident. And, in those early days, this coalescing around the chaos formed a camaraderie that bound us together. Even years later, I still look back on those Zoom calls with a fondness for how close I felt to those that were there fighting alongside me.

But, the kindness and compassion that we gave each other internally meant nothing to our customers. The incidental joy that we felt from a shared struggle was no comfort to those that were paying us to provide a stable product.

Our Chief Technical Officer (CTO) at the time understood this. He never measured downtime in minutes, he measured it in lost opportunities. He painted a picture of customers, victimized by our negligence:

"People don't care about Service Level Objectives and Service Level Agreements. 30 minutes of downtime isn't much on balance. But, when it takes place during your pitch meeting and prevents you from securing a life-changing round of venture capital, 30 minutes means the difference between a path forward and a dead-end."

The CTO put a Root Cause Analysis (RCA) process in place and personally reviewed every write-up. For him, remediating the incident was only a minor victory—preventing the next incident was his real goal. Each RCA included a technical discussion about what happened, how we became aware of the problem, how we identified the root cause, and the steps we intended to take in order to prevent it from occurring again.

The RCA process—and subsequent Quality Assurance Items—did create a more stable platform. But, the platform is merely a foundation upon which the product lives. Most of our work took place above the platform, in the ever-evolving user-facing feature-set. To be certain, a stable platform is necessary. But, as the platform stabilized, the RCA process began to see a diminishing return on investment.

In a last ditch effort to drive better outcomes, a Change Control Board (CCB) was put in place. A CCB is a group of "subject matter experts" that must review and approve all changes being made to the product with the hope that naive, outage-causing decisions will be caught before they cause a problem.

Unfortunately, a Change Control Board is the antithesis of worker autonomy. It is the antithesis of productivity. A Change Control Board says, "we don't pay you to think." A Change Control Board says, "we don't trust you to use your best judgement." If workers yearn to find fulfillment in self-direction, increased responsibility, and a sense of ownership, the Change Control Board seeks to strip responsibility and treat workers as little more than I/O devices.

And yet, with the yoke of the CCB in place, the incidents continued.

After working on the same product for over a decade, I have the luxury of hindsight and experience. I can see what we did right and what we did wrong. The time and effort we put into improving the underlying platform was time well spent. Without a solid foundation on which to build, nothing much else matters.

Our biggest mistake was trying to create a predictable outcome for the product. We slowed down the design process in hopes that every single pixel was in the correct location. We slowed down the deployment pipeline in hopes that every single edge-case and feature had been fully-implemented and tweaked to perfection.

We thought that we could increase quality and productivity by slowing everything down and considering the bigger picture. But, the opposite is true. Quality increases when you go faster. Productivity increases when you work smaller. Innovation and discovery take place at the edges of understanding, in the unpredictable space where customers and product meet.

Eventually, we learned these lessons. The outages and incidents went from a daily occurrence to a weekly occurrence to a rarity. At the same time, our productivity went up, team morale was boosted, and paying customers started to see the value we had promised them.

None of this would have been possible without feature flags.

Feature flags changed everything.

Have questions? Let's discuss this chapter: https://bennadel.com/go/4540