Feature Flags Book: Life-Cycle Of A Feature Flag
A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: a complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with a simple system.
— John Gall (Gall's Law)
If you're used to taking a feature entirely from concept to finished code before ever deploying it to production, it can be hard to understand where to start with feature flags. In fact, your current development practices may be so deeply ingrained that the value-add of feature flags still isn't obvious—I know that I didn't get it at first.
To help illustrate just how wonderfully different feature flags are, I'd like to step through the life-cycle of a single feature flag as it pertains to product development. This way, you can get a sense of how feature flags change your product development workflow; and, why this change unlocks a lot of value.
In 2023, I published a book titled, "Feature Flags: Transform Your Product Development Workflow". This book contains everything that I've learned over the last 7 years about integrating feature flags into my product development. But, a static book can only take you so far. In an effort to make the book more interactive, I've created a series of blog posts—one per chapter—that provide a place in which the readers and I can discuss the content. You can purchase the book and / or read a preview of each chapter on the book's mini-site. Feel free to leave a question or a comment down below.
- Of Outages And Incidents
- The Status Quo
- Feature Flags, An Introduction
- Key Terms And Concepts
- Going Deep On Feature Flag Targeting
- The User Experience (UX) Of Feature Flag Targeting
- Types Of Feature Flags
- Life-Cycle Of A Feature Flag
- Use Cases
- Server-Side vs. Client-Side
- Bridging The Sophistication Gap
- Life Without Automated Testing
- Ownership Boundaries
- The Hidden Cost Of Feature Flags
- Not Everything Can Be Feature Flagged
- Build vs. Buy
- Track Actions, Not Feature Flag State
- Logs, Metrics, And Feature Flags
- Transforming Your Company Culture
- People Like Us Do Things Like This
- Building Inclusive Products
- An Opinionated Guide To Pull Requests (PRs)
- Removing The Cost Of Context Switching
- Measuring Team Productivity
- Increasing Agility With Dynamic Code
- Product Release vs. Marketing Release
- Getting From No To Yes
- What If I Can Only Deploy Every 2 Weeks?
- I Eat, I Sleep, I Feature Flag
Reader Comments
I received this question from a reader (which I'll try to answer in a subsequent comment):
Something that I didn't quite grasp from the book was what the evolution of a standard "stuck in the 00s" team might look like. I know you went through a step-by-step plan to find pain points and fix them, but it's hard for me to envision just how quality assurance (QA) and the business checks code. Do you still use lower environments at all? If so, how do you use them?
Our process for example looks like this:
develop
,develop
CI/CD deploys to dev environment.master
,master
CI/CD deploys to prod environment.I understand that we would rearrange the steps to unblock the dev. Would
master
then just be deployed everywhere and QA and the business would just have the feature "on" in dev first? (I believe I saw a sentence mentioning this in passing). Then, if they find an issue, do they put in a new issue? Right now QA works on the same issue as the developer as a separate task to on the issue.I realize that some of these questions are more organization specific, but I think it could be helpful to get in the muck a bit for those orgs who are really waterfall even though they use agile teams for their waterfall (which is most, I bet).
There's a lot in this question, so let me do a bit of stream of consciousness.
Where I worked, different teams with different strategies had different levels of success in integrating feature flags into their development workflows. It seemed that teams with a more robust QA phase had a harder time adapting to feature flags. Or rather, they had a harder time allowing feature flags to change the course of their development practices.
Before feature flags, all of our teams had a pre-production environment. But, each team was a little different. Some teams would test in this "staging" environment for a short period of time before moving onto production. And, other teams would continually push to "staging" but only occasionally promote those changes to production.
For teams with minor QA practices (admittedly mine was one of them), once we became comfortable using feature flags we actually stopped using the staging environment altogether. In our case, we would merge our code into
master
(via GitHub PR approval process) and keep the code hidden behind a feature flag. Then, once it was deployed to the production environment, we would enable the feature flag for internal testing and product approval.For the low-QA teams (like mine), the staging environment became a source of friction that slowed down deployments. It also meant yet another environment that had to be configured (vis-a-vis feature flag targeting). We were happy to stop using staging and start using production exclusively.
For other teams that had a lot of QA process, they continued to use the staging environment for testing. This was done either by pushing explicit feature branches up to staging; or, by leaning on the fact that each
master
branch was automatically deployed to staging any time it was updated.As one of the teams that didn't have as much QA, watching the other teams was interesting. It seemed that this combination of heavily using both staging and feature flags together caused a lot of confusion. There seemed to be no clarity or consistency on:
Ironically, the teams with a lot of QA process actually had a spike in incident rates during this transitional period. This was often due to a feature accidentally going to production before it was meant to; or, because someone accidentally enabled a feature flag in the wrong environment.
Whenever you change a workflow, there's going to be a few growing pains and some trial-and-error. The more you can reduce the complexity, the fewer things there are that can go wrong.
So, I guess, to address the root of your question (finally), my personal goal would be to stop using the pre-production environment for testing as much as possible. This means fewer moving parts which means a smaller mental model for how the world works. Of course, it also means the biggest change to the workflow.
Based on what you are saying about the QA + business approval being part of the same ticket (identified as sub-tasks), I don't think much of that actually has to change. It just means that some of the steps are now happening in a different environment.
Of course, this is all predicated on the notion that you can successfully gate a change behind a feature flag such that changes to the user experience (UX) won't be seen by the general audience until after QA + business sign-off on the changes. If you can't do that, then you can't go straight to production.
I will add that our Platform team continues to make heavy use of the pre-production environment. We never came up with strategies to weave feature flags into the platform work, only the product work. As such, the platform team does extensive pre-prod testing make sure they aren't about the blow-up the world with a deployment.
One other thought I had was that, if you have a lot of process around product development, it's always easier to start small. Meaning, instead of implementing a company-wide change, see if you can start with a single team and trial changes to the product development process. This way, you can prove-out the changes (in workflow) without having to worry about every team getting it right.
What's in scope of your platform team? I know that some orgs interpret project/product/platform teams differently.
@Justin,
For us, the platform team worked on all the low-level stuff that revolved around deployment pipelines, Kubernetes (K8), database availability, S3 bucket policies, EC2 machine types, log aggregation mechanics, metrics aggregation, disaster recovery plans, the VPN. Basically, they did all the stuff that was kind of "behind the scenes".
So, for example, if they wanted to upgrade Kubernetes or a MongoDB version, they would do that in the pre-production environment first and let it run for a few days to see if anyone complained.
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →