Dynamically Extending A Long-Lived Distributed Locks With Redis In Lucee CFML 5.3.7.47
At InVision, we use Redis to implement our distributed locks - that is, locks that need to be honored across a number of horizontally-scaled ColdFusion pods. In the vast majority of cases, these locks are short-lived and could likely be removed if we had better practices around idempotent operations. However, one of the distributed locks is very long-lived and creates synchronization around a heavy processing workflow. The duration of this distributed lock poses a significant problem in a context where ColdFusion pods can be killed at any moment, leaving the distributed lock open in an unmanaged state. As such, I wanted to explore the idea of using short-lived distributed locks that can be dynamically extended in Lucee CFML 5.3.7.47.
The typical distributed lock workflow looks something like this:
- Obtain the distributed lock.
- Perform the synchronized work.
- Release the distributed lock.
In our ColdFusion application, distributed locks are implemented using simple Redis SET
operations with an NX
flavor; meaning, the key is only set if it doesn't already exist. And, when the key is set, it is defined with a TTL, or Time-To-Live, which defines how long the key can exist before the Redis database automatically expunges it from the key-store.
In a "happy path" workflow, the TTL doesn't really matter since the ColdFusion application code will explicitly delete the key once the synchronized work is completed. However, if the ColdFusion pod were to be killed unexpectedly during step-2 above, the application won't delete the key; which means, the Redis key will exist until the TTL runs out.
For a long-running task, the TTL on the key might be something high, like 20-minutes. So, if the ColdFusion pod is suddenly rescheduled by Kubernetes (K8), for example, the same synchronized workflow may not be able to start-up again on a different pod for 20-minutes - not until the old Redis key expires.
To hedge against this dynamic nature of a horizontally-scaled system, I want to experiment with keeping the TTL on the distributed lock keys short; and then, pushing out the TTL at various points in the workflow. This would make the distributed lock workflow look more like this:
- Obtain distributed lock (with a short TTL).
- Perform some synchronized work.
- Push out the TTL a small amount.
- Do some more work.
- Push out the TTL a small amount.
- Do some more work.
- Push out the TTL a small amount.
- Do some more work.
- Release the distributed lock.
With this approach, if the pod were to crash or be moved suddenly, the now-unmanaged Redis key will only block further synchronized work for a short period of time (before the TTL runs its course).
To see what I mean, I put together a small demo in which I obtain a distributed lock and alternate some sleep()
commands and some extendLock()
operations. As this ColdFusion code is running, we can then examine the TTL of the Redis key to see how it is changing:
<cfscript>
// Distributed locks can prevent different servers from stepping on each other's
// processing workflows. However, in a horizontally-scaled system, pods can die at
// any time (ex, the pod might crash, Kubernetes might need to suddenly schedule the
// pod on a different node, or Amazon AWS might revoke a spot-instance). As such, an
// "open lock" may be orphaned unexpectedly, leaving the lock OPEN for an
// unnecessarily long period of time. For long-lived locks, this can pose a problem
// because it leaves the system in an unmanaged state. To cope with this, we can use
// a "dynamic lock" that is short-lived (and will quickly expire if left "open");
// but, which can be "extended" from within the "work" algorithm.
withDynamicLock(
"my-dynamic-lock-name",
( extendLock ) => {
loop times = 5 {
// Simulate some "work" inside the distributed lock.
sleep( 2000 );
// Since we know that our work algorithm (in the demo) needs to run in
// 2-second chunks, we can continue to push the lock timeout for another
// 2-seconds at a time.
extendLock( 2 );
}
echo( "Woot! I can haz success! Lock will be released!" );
}
);
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
/**
* I obtain a distributed lock with the given name and "work" operator. The operator
* will be invoked with a Function that can extend the lock TTL (Time To Live) for a
* given number of seconds. If the lock cannot be obtained, an error is thrown.
*
* @lockName I am the name of the lock being obtained.
* @lockOperator I am the operator to invoke once the lock has been obtained.
*/
public any function withDynamicLock(
required string lockName,
required function lockOperator
) {
return(
withDynamicLockSettings({
name: lockName,
ttl: 60,
operator: lockOperator
})
);
}
/**
* I obtain a distributed lock with the given configuration. The available
* configuration options are as followed. If the lock cannot be obtained, an error is
* thrown.
*
* - name: I am the name of the lock being obtained.
* - ttl: I am the initial time to live (in seconds) of the underlying Redis key.
* - operator: I am the operator to invoke once the lock has been obtained.
*
* @config I am the distributed lock configuration.
*/
public any function withDynamicLockSettings( required struct config ) {
var lockName = config.name;
var lockOperator = config.operator;
// While we'll use the ttlInSeconds for the initial NX/EX creation of the key,
// we'll later use the absolute time when we extend the key life.
var lockTtlInSeconds = config.ttl;
var lockExpiresAt = ( ( getTickCount() / 1000 ) + lockTtlInSeconds );
// NOTE ON REDIS POOL USAGE: Since this lock may be wrapped around a long-running
// process, we don't want to hold the Redis Resource for the entire duration as
// this may end-up exhausting the Redis pool (if the application node has several
// locks running concurrently). As such, we're going to get a new Redis Resource
// from the pool at each step in this process.
// STEP 1: Try to obtain the lock.
// --
// NOTE: In a "production" setting, we might have some sort of exponential back-
// off that waits for the lock from some "timeout" period. However, in this demo,
// to keep things simple, I'm either getting the lock OR FAILING.
var setResult = application.redisPool.withRedis(
( redis ) => {
return(
redis.set(
lockName,
"dynamic redis lock",
"NX",
"EX",
lockTtlInSeconds
)
);
}
);
if ( isNull( setResult ) ) {
throw(
type = "CouldNotObtainDynamicLock",
message = "Failed to obtain dynamic lock",
detail = "Lock [#lockName#] was already in place."
);
}
// STEP 2: Invoke the lock operator.
// --
// At this point, we were able to set the Redis Key, which means that we obtained
// the dynamic lock.
try {
// When we invoke the operator, we need to pass-in a Function that will
// extend the duration of the lock, which is really just extending the TTL of
// the Redis Key under the hood.
var extendLock = function( required numeric additionalTimeInSeconds ) {
application.redisPool.withRedis(
( redis ) => {
lockExpiresAt += additionalTimeInSeconds;
redis.expireAt( lockName, lockExpiresAt );
}
);
};
return( lockOperator( extendLock ) );
// STEP 3: Release the distributed lock.
// --
// No matter what happens, release the dynamic lock at the end. This will get
// called if the lock operator completes successfully or throws an error.
} finally {
application.redisPool.withRedis(
( redis ) => {
redis.del( lockName );
}
);
}
}
</cfscript>
As you can see, when I call withDynamicLock()
to obtain the distributed lock, my "work" operator is invoked with an argument, extendLock()
. This extendLock()
function will push out the TTL of the underlying Redis key by an arbitrary number of seconds. This way, the synchronized algorithm can make informed decisions about how much it wants to hedge-against a "sad path" scenario.
Now, if we initiate this ColdFusion code and then look at the state of the Redis database, we get the following terminal output:
As you can see, the Redis TTL (Time to Live) on our distributed lock key is holding steady at about 59-seconds. This is because, after every 2 seconds of work that we do (simulated with a sleep()
command), we then extend the TTL of the distributed lock key by another 2-seconds. As such, the distributed lock is held-open for the duration of the synchronized work.
And, if the ColdFusion server were to crash at any point, the worst-case-scenario is that the Redis key will exist for only about a minute before another pod can pick up and re-run the synchronized algorithm. The trick will be balancing the protection of a long-lived key with the possibility that you may not extend it before the short TTL expires. But, at least I think I have a path forward for my Lucee CFML 5.3.7.47 application.
Is This an Architectural "Code Smell"?
One could make the argument that a "distributed lock" is a code smell. And, I suspect that one could - even more easily - argue that a long running distributed lock is an even stankier code smell. And, you're probably right to some degree. I already mentioned that I think a number of our distributed locks could be obviated with better idempotent mechanics; and, a "long running" lock probably indicates a process that should be isolated behind some sort of "worker queue" or some other decoupling technique.
But, let's not let best be the enemy of good. I don't have the people or the time to do a major refactoring of how some long-running work is being performed. So, the best I hope to achieve here is taking a problematic situation and make it more tenable. And, I think that a dynamically extended distribute lock with a short TTL is probably a "good enough" strategy.
Want to use code from this post? Check out the license.
Reader Comments
@All,
In the video, I mentioned that I thought calling
EXPIREAT
on an non-existing key would create a new key. This is not true. From what I can tell (based on experimentation), if you call eitherEXPIRE
orEXPIREAT
on non-existing key, it's basically a no-op and nothing happens. This is great!@All,
After posting this, one of my engineers - Jan Sedivy - mentioned that he uses a similar technique; only, he uses an asynchronous goroutine to update the TTL while the main process is executing. This sounded very clever, and I wanted to see if I could do the same thing using
CFThread
:www.bennadel.com/blog/4006-extending-a-distributed-lock-ttl-using-cfthread-redis-and-lucee-cfml-5-3-7-47.htm
I really like this approach!