You're reading the free online version of this book. If you'd like to support me, please considering purchasing the e-book.
Bridging the Sophistication Gap
As you read through the use cases chapter, it may have occur to you that large, sophisticated companies have different ways of solving the same problems. For example, instead of using feature flags to implement rate limiting, a more sophisticated company might put that logic in an Application Load Balancer or a reverse proxy.
Or, instead of using feature flags to implement IP-blocking, a more sophisticated company might add reactive request filtering to a Web Application Firewall or dynamically update ingress routing rules.
Or, instead of gating code behind a feature flag, a more sophisticated company might adjust the traffic distribution across an array of Blue/Green deployment environments.
Or, instead of using feature flags to adjust log emissions, a more sophisticated company might change the filtering in their centralized log aggregation pipeline.
Large, sophisticated companies can do large, sophisticated things because they have massive budgets and hordes of highly-specialized engineers that are focused on building specialized systems. This allows them to solve problems at a scale that most of us cannot fathom.
But, this difference in relative sophistication isn't a slight against feature flags. Exactly the opposite! The difference is a spotlight that highlights the outsize value of feature flags—that we can use such simple and straightforward techniques to solve the same class of problems at a fraction of the cost and complexity.
Nothing illustrates this as clearly as load testing. When a sophisticated company needs to load test a new feature—that is, to test whether or not the system can handle a large volume of requests in parallel—it sets up a load testing environment.
In order for a load test to be meaningful, the test environment has to mirror the production environment as closely as possible. Which means:
The size and configuration of each underlying test server has to match production (in terms of CPU, RAM, and IOPs allocation).
The horizontal scaling of these test servers has to match production so that they can dynamically spin-up in order to accept increasing load.
The size and configuration of each underlying database has to match production (in terms of CPU, RAM, and IOPs allocation) so that they can use the same amount of working memory without having to swap on disk.
The data volume and data diversity within the test databases has to match production so that the table statistics lead to the same query plan executions. But, of course, this data needs to be sanitized for security and compliance reasons; which means, there's probably some scheduled task that is continually keeping the test databases synchronized—and anonymized—with production.
There needs to be some sort of traffic replay that is putting unrelated load on the test servers. This way, the test servers aren't just testing the new feature, they're testing the new feature on top of the normal server traffic.
And, of course, there needs to be something performing the actual load testing of the new feature itself.
As you can imagine, setting up and consuming a test environment like this is very complicated and very expensive (both in time and in money). And, the ugly truth is, no matter how many resources get poured into a testing environment like this, it will forever be a best approximation of what production is actually like.
Because, there's no place like production.
Now, you may not have realized it, but we already talked about load testing in a previous chapter; only, we called it an "incremental roll-out".
Consider for a moment why sophisticated companies jump through so many expensive hoops in order to set up a testing environment: safety. The isolation provided by the testing environment means that—no matter what happens during a load test—the production environment will remain unaffected.
This is exactly what an incremental roll-out does. Only, without the exorbitant cost and complexity. When we gate a feature behind a feature flag, the incremental roll-out of said feature allows us to see if-and-how it might negatively affect production—well before any harm is actually inflicted.
A vanishingly small number of changes to an application actually merit load testing. But, for the sake of discussion, let's say that we're about to release one such change. We can start off by releasing the change to our user only (with ID: 123
):
{
"really-scary-change": {
variants: [ false, true ],
distribution: [ 100, 0 ],
rule: {
operator: "IsOneOf",
input: "userID",
values: [ 123 ],
distribution: [ 0, 100 ]
}
}
}
With this change, every single user in the application will continue to receive the traditional experience while we—the lone engineer—can go through the application and try out the new experience. There's almost nothing that a single user can do to take down production. But, starting with such a narrow release can certainly give us an early indication of both qualitative issues (ie, does the new workflow feel slow) and quantitative issues (ie, are there any metrics or dashboards that now show an irregularity).
This narrow testing can last a few minutes; or, it can last a few weeks. It doesn't much matter because there's zero cost associated with your level of caution. No test servers are running. No test databases are running. It's just the production servers and some internal control flow branching.
And, if all signs look good, you can slowly widen the release. Perhaps moving onto your internal team (with email domain: @example.com
):
{
"really-scary-change": {
variants: [ false, true ],
distribution: [ 100, 0 ],
rule: {
operator: "EndsWith",
input: "userEmail",
values: [ "@example.com" ],
distribution: [ 0, 100 ]
}
}
}
Continue checking the errors logs and the dashboards to see how the system is responding. Are there qualitative issues? Are there quantitative issues? And, if all signs look good, you can slowly widen the release. Perhaps to 1% of the general audience:
{
"really-scary-change": {
variants: [ false, true ],
distribution: [ 99, 1 ],
rule: {
operator: "EndsWith",
input: "userEmail",
values: [ "@example.com" ],
distribution: [ 0, 100 ]
}
}
}
And then 2%. And 3%. And so on. All while watching the system for contraindications. And, at each increment, you can remain confident that what you're seeing is going to be reflective of the production servers because this is the production environment.
Congratulations, you just ran a load test safely in production! And, it cost the company nothing.
Now, I'm not saying that feature flags completely replace all of the sophisticated techniques that sophisticated companies use. But, I am saying that, by using feature flags, relatively simple companies can reap many of the same rewards with only a fraction of the cost and complexity.
In this industry, many companies strive to enter the FANG (Facebook, Amazon, Netflix, Google) stratosphere. But, there's nothing quite as destructive to a company as attempting to apply FANG "best practices" with small team resources. Instead, small companies should embrace the size of their team; and, seek out the 80/20 solutions (80% of the value with 20% of the effort).
Feature flags allow small teams to do just that!
Have questions? Let's discuss this chapter: https://bennadel.com/go/4550
Copyright © 2025 Ben Nadel. All rights reserved. No portion of this book may be reproduced in any form without prior permission from the copyright owner of this book.