Creating A Poor Man's Exponential Backoff And Retry Algorithm In Legacy Code Using ColdFusion
The other day, I was attempting to implement a backoff and retry algorithm around a remote API call that was failing intermittently due to network connection errors. Normally, I would try to find some sort of abstraction that encapsulated the backoff and retry logic. But, I was working in legacy code and I needed to use a feature-flag to completely (and safely) branch the API request logic. As such, I was struggling to find the right separation of concerns. Eventually, I just built the retry logic right into a feature-flagged method. And, I actually kind of liked the way it turned out. Especially since it didn't involve any math.
When I went to implement the backoff and retry logic, my first thought was to keep track of the backoff duration and then multiply it by some quasi-random value after each failed API call. But, I was having trouble doing all the maths in my head. Did I want to stop after a certain number of attempts? Did I want to stop after a total amount of backoff? Was I waiting long enough? Was I waiting too long? What would the worst-case scenario look like from an experiential stand-point?
For some reason, my brain just wasn't working for me. So, I decided to dismiss any clever approach and just brute force it. Rather than relying on a formula for calculating the backoff, I just hard-coded the durations. And, it turns out, this approach makes the logic very easy to follow (in my opinion). Furthermore, the collection of backoff durations ends-up doubling as the number of iterations in the retry-loop, which basically removes all of the complexity.
To see what I mean, here's a mock example of a method - doTheThing() - that has to make an unstable API call. Notice that the API call is being performed inside a for-loop that is iterating over a set of backoff-durations:
<cfscript>
doTheThing( 4 );
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
/**
* I do the thing that requires a brittle network connection. I will retry failed
* connections using a reasonable backoff; and, eventually throw an error if none
* of the retries are successful.
*
* @id I am the data thing.
* @output false
*/
public void function doTheThing( required numeric id ) {
// Rather than relying on the maths to do backoff calculations, this collection
// provides an explicit set of backoff values (in milliseconds). This collection
// also doubles as the number of attempts that we should execute against the
// remote API.
// --
// NOTE: Some randomness will be applied to these values as execution time.
var backoffDurations = [
1000,
2000,
4000,
8000,
16000,
32000,
64000,
0 // Indicates that the last timeout should be recorded as an error.
];
for ( var backoffDuration in backoffDurations ) {
try {
makeApiRequest( id );
// If we made it this far, the API request didn't throw an error, which
// means that it was successful; return out of the method and retry-loop.
return;
} catch ( any error ) {
// If the error is a retriable error AND we still have a non-zero backoff
// to consume, sleep the thread and try the request again.
if ( isRetrialbeError( error ) && backoffDuration ) {
sleep( applyJitter( backoffDuration ) );
// Otherwise, we can't recover from this error, so let it bubble up.
} else {
rethrow;
}
}
}
}
/**
* I apply a 20% jitter to a given backoff value in order to ensure some kind of
* randomness to the collection of requests that may stampede towards an API.
*
* @value I am the backoff time being nudged.
* @output false
*/
public numeric function applyJitter( required numeric value ) {
// Create a jitter of +/- 20%.
var jitter = ( randRange( 80, 120 ) / 100 );
return( fix( value * jitter ) );
}
</cfscript>
This approach isn't clever. And it's not reusable. And it doesn't have a nice separation of concerns. But, I think it's really easy to follow. And, it's really easy to see how the backoff values can be changed; and, what kind of impact your changes will have. And, I think there's a lot of value in the readability of it.
Now, if this code didn't have to be behind a feature-flag, I probably would have, at least, pushed the retry logic into the lower-level makeApiRequest() call. However, since the makeApiRequest() method was already being consumed by other legacy code, I didn't want to touch it - I wanted [just about] all of the new code to be in a completely separate flow of control that was hidden behind a feature-flag.
Often times, when working with legacy code, you have to make decisions that take safety, time to implement, and other competing priorities into account. And, not every solution is perfect. But, in this case, I think I found a rather enjoyable balance of simplicity and effectiveness. I don't love the separation of concerns; but, I do love how easy it is to look at the collection of backoff durations and immediately get a sense of how long an API call could take; and, how many requests are going to be attempted.
Want to use code from this post? Check out the license.
Reader Comments