You're reading the free online version of this book. If you'd like to support me, please considering purchasing the e-book.
Measuring Team Productivity
Fuzzy math over time still shows trends.
— Mike Maughan (podcast, No Stupid Questions)
When your team uses feature flags to work small and build incrementally, what you end up with is a culture of shipping. In this culture, deployments become the heartbeat of the system; and, a bellwether of health. When a team is functioning well, its engineers ship often. And, when a team is having trouble, shipping slows down or becomes a rarity.
Deployments, therefore, act as a metric that can proxy team health and productivity. There's no absolute value here that indicates health; but, deviations from the baseline will reveal truths about the team, its engineers, and their relationship to the rest of the organization.
For example, if an engineer hasn't shipped code in a day or two, it might be an indication that they are stuck. Perhaps the project requirements are unclear; and, they're afraid to ask questions. Or, perhaps the work is too complex; and, they're having trouble decomposing the work into small, incremental steps.
And, if the team as a whole isn't shipping with good frequency, it might be an indication that inter-team communication is broken. Perhaps the design team is nit-picking too many details. Or, perhaps the QA team is under-staffed.
In any case, when you create a culture of shipping, any sustained drop-off in deployment frequency means that something is probably wrong. And merits further investigation by a manager or a mentoring engineer.
Goodhart's Law states that, "when a measure becomes a target, it ceases to be a good measure". Which implies that people will game a system whenever pressure is applied. As such, it's natural for an organization to scoff at the notion that number of deployments can act as a proxy for health—they assume that such a measurement will be gamed just like any other.
But, I don't believe that Goodhart's Law applies here because all deployments are preceded by a pull request. The PR review process is, inherently, a social activity. And, in any social activity, there is pressure to act appropriately.
Furthermore, PRs are a system of record. Which means, anyone in the organization can enter any repository and view both PR content and engineering contributions at any time. Nothing about the PR review process is secretive; and so, nothing about deployments can be secretive.
In short, there's really no way to game deployments. Nor is there any compelling reason to do so. After all, if either an individual engineer or an entire team isn't shipping with regularity, there's no punishment to be rendered—only, help to be offered.
Burndown Charts Are Problematic
A burndown chart is a graph that measures the amount of remaining work over time, with the line trending down and to the right. The problem with a burndown chart is that it assumes that the total quantity of work was known ahead of time. But, these assumptions are never accurate.
Engineers are absolutely atrocious at estimating work; especially when estimating large swaths of work. As the work unfolds, new problems surface; and, new complexities are revealed. This often changes the understanding of the project scope; which, correctly leads to more ticket creation.
But, when you create new tickets, you corrupt the burndown chart. And, this makes the burndown chart a poor measure of productivity.
In response, a team might mandate that no new tickets can be moved into the current work cycle. But, all this does is place an artificial constraint on the project and make the work harder to do.
Taking complex work and breaking it down into smaller tasks is what an engineer does. And, an engineer should never be punished for doing their job. But, this is often what a burndown chart does for the team culture.
Instead, new tickets should be celebrated. And shipping those tickets to production should be used as the measure of productivity.
The Five Whys of Deployments
Within a culture of shipping, when a team isn't deploying code with high frequency, we know that something is wrong. But, we don't always know how to help. One benefit of measuring engineering health through deployment frequency is that it gives us a consistent place to start debugging.
In fact, if we use the Five Whys technique in order to uncover problems, we always know what the first question is:
Why isn't (team or engineer) shipping more frequently?
The Five Whys is an investigatory technique in which a root cause is located by iteratively stating a problem and then asking the question, "Why?" This technique forces us to look past superficial symptoms and identify deeper—potentially systemic—issues that lay at the heart of the matter.
For example:
Molly hasn't shipped code to production in 3 days. Why?
Molly's code is complete; and, her PR's been approved; but, she's still waiting on approval from the Quality Assurance (QA) team.
Why is Molly waiting for QA's approval?
Our engineering guidelines state that QA must approve all changes being made to this application.
Why is QA allowed to block deployments in this application?
Before we started using feature flags—before the notion of incremental product development was technically viable for our team—the only option our engineers had was to deploy massive changes all at once. This proved very challenging for our engineers; and, we ended up shipping a lot of bugs to production.
Now that we have feature flags, why is this QA constraint still in place?
Adopting feature flags into our workflow was a gradual process that took place over many months. And, once it finally became part our team's DNA, we never stopped to reconsider other parts of our development workflow that might be impacted.
As you can see, the Five Whys allowed us to dig deep into the operations of our team. By starting with deployment velocity as a goal, we were able to iteratively peel back the layers until we discovered that the current draconian QA practices were never re-evaluated in the context of more modern development techniques.
When applying the Five Whys, it's helpful to try and identify several different issues at each level. Seeing as there's no obvious way to determine which issues are "root" and which are "incidental", it's not unusual to take a few wrong turns.
For example, it would have been just as easy (and perhaps more natural) to explain the QA latency issue as a head-count concern:
The QA team only has 2 engineers that currently services 7 different teams. They are terribly overwhelmed and have become a throughput bottleneck.
If we'd gone down that path, we may have concluded that the root cause was a budgetary constraint. And yes, increasing the size of the QA team would likely have a positive impact on deployment velocity. But, if we stopped there, we'd have missed the fact that a more impactful root cause was actually a misalignment of best practices.
It's important to remember that the goal of the Five Whys technique is never to administer blame. People want to do the right thing. But, they're often operating within constraints that are beyond their control. The goal of the Five Whys is simply to unblock these people; and help them become their best selves.
Remote Work vs. In-Office Work
Some digital product companies—and the managers therein—believe that oversight comes from seeing "butts in seats". They believe that a physical presence is the only way to drive accountability and to keep honest people honest. And so, they mandate in-office work; and, either limit the amount of time allowed for remote engagement; or, they ban it altogether.
The reality is, many managers don't actually know how to tell if a team is operating well. And, they believe that if they can just see people working, then they will somehow—through some unknown means—be able to intuit if a business is running smoothly.
This is magical thinking. While innovation can be driven by gut instinct and leadership insights, the health of a business must be driven by numbers. Marketers look at conversion rates; Sales associates look at deals closed; CFOs look at burn rates and Annual Recurring Revenue. And—if you've created a culture of shipping—technical leaders can look at deployment rates.
Note: My intention here is not to diminish the role of managers—managers do a heck of a lot more than measure team velocity. But, it is part of what they do; and, it's a part that often proves challenging in a traditional development context.
The exciting thing about this is that deployment rates can be measured from anywhere. In-office, remote, hybrid—it doesn't matter. Shipping is shipping. And, if you can see how often the engineering teams are shipping code, you'll foster an understanding of how the entire EPD organization is performing.
Of course, the velocity of deployments can only be helpful when a team is using feature flags; and, is specifically using them to incrementally build and deploy features. Otherwise, the deployment timelines are too long and too arbitrary to be meaningful. In order to unlock better engineering practices and better management practices, a team must be using feature flags.
Have questions? Let's discuss this chapter: https://bennadel.com/go/4563
Copyright © 2025 Ben Nadel. All rights reserved. No portion of this book may be reproduced in any form without prior permission from the copyright owner of this book.