Wrapping Database Gateways In A Retriable Proxy For Lock Timeouts In ColdFusion And MySQL
In the past couple of days, I've had a lot of fun - as one does - digging into CFQuery database query timeouts and transaction lock timeouts in ColdFusion and MySQL. And, once I figured out how to manually cause an "InnoDB lock wait timeout" error in a controlled manner, the obvious next step was seeing if I could build some retry mechanics that would catch lock timeout errors and retry the transaction automatically. Of course, so as not to have to build this logic into every single data-access gateway, I wanted to see if I could create a proxy component that would use meta-programming techniques to generically implement lock timeout retries.
To experiment with this generic proxy approach, I first had to create a simple database gateway component in ColdFusion. This one does some simple CRUD (Create, Read, Update, Delete) operations on a "friend" table. But, it also includes a "special method" for setting the groundwork for a lock timeout:
<cfcomponent
output="false"
hint="I provide a data-access layer for the Friend collection.">
<cffunction name="clearFriends" access="public" returntype="void" output="false">
<cfquery name="local.results">
TRUNCATE TABLE
friend
;
</cfquery>
</cffunction>
<cffunction name="createFriend" access="public" returntype="numeric" output="false">
<!--- Define arguments. --->
<cfargument name="name" type="string" required="true" />
<cfargument name="isBFF" type="boolean" required="true" />
<cfargument name="updatedAt" type="date" required="false" default="#now()#" />
<cfquery name="local.results">
INSERT INTO
friend
SET
name = <cfqueryparam value="#name#" cfsqltype="cf_sql_varchar" />,
isBFF = <cfqueryparam value="#isBFF#" cfsqltype="cf_sql_tinyint" />,
updatedAt = <cfqueryparam value="#updatedAt#" cfsqltype="cf_sql_timestamp" />
;
SELECT
( @@Identity ) AS id
;
</cfquery>
<cfreturn results.id />
</cffunction>
<cffunction name="deleteFriend" access="public" returntype="void" output="false">
<!--- Define arguments. --->
<cfargument name="id" type="numeric" required="true" />
<cfquery name="local.results">
DELETE FROM
friend
WHERE
id = <cfqueryparam value="#id#" cfsqltype="cf_sql_integer" />
LIMIT
1
</cfquery>
</cffunction>
<cffunction name="getFriends" access="public" returntype="query" output="false">
<cfquery name="local.results">
SELECT
id,
name,
isBFF
FROM
friend
</cfquery>
<cfreturn results />
</cffunction>
<cffunction name="updateFriend" access="public" returntype="void" output="false">
<!--- Define arguments. --->
<cfargument name="id" type="numeric" required="true" />
<cfargument name="name" type="string" required="true" />
<cfargument name="isBFF" type="boolean" required="true" />
<cfargument name="updatedAt" type="date" required="false" default="#now()#" />
<!---
By default MySQL will wait 50-seconds for a row-lock to timeout. Since I
don't want to wait for that (we're trying to create race-conditions here),
I am using the "innodb_lock_wait_timeout" MySQL session variable in order
to force the local lock timeout to be 2-seconds.
--->
<cfquery name="local.results">
SET SESSION innodb_lock_wait_timeout = 2; -- Is 50 by default.
UPDATE
friend
SET
name = <cfqueryparam value="#name#" cfsqltype="cf_sql_varchar" />,
isBFF = <cfqueryparam value="#isBFF#" cfsqltype="cf_sql_tinyint" />,
updatedAt = <cfqueryparam value="#updatedAt#" cfsqltype="cf_sql_timestamp" />
WHERE
id = <cfqueryparam value="#id#" cfsqltype="cf_sql_integer" />
;
</cfquery>
</cffunction>
<!--- SPECIAL METHOD. ---------------------------------------------------------- --->
<!--- SPECIAL METHOD. ---------------------------------------------------------- --->
<!--- SPECIAL METHOD. ---------------------------------------------------------- --->
<cffunction name="updateFriendSlowly" access="public" returntype="query" output="false">
<!--- Define arguments. --->
<cfargument name="id" type="numeric" required="true" />
<cfargument name="name" type="string" required="true" />
<cfargument name="isBFF" type="numeric" required="true" />
<cfargument name="updatedAt" type="date" required="false" default="#now()#" />
<cfquery name="local.results">
UPDATE
friend
SET
name = <cfqueryparam value="#name#" cfsqltype="cf_sql_varchar" />,
<!---
CAUTION: We are performing a SLEEP() here in order to create a
race-condition for other queries that might be trying to update
this record.
--->
isBFF = SLEEP( <cfqueryparam value="#isBFF#" cfsqltype="cf_sql_integer" /> ),
updatedAt = <cfqueryparam value="#updatedAt#" cfsqltype="cf_sql_timestamp" />
WHERE
id = <cfqueryparam value="#id#" cfsqltype="cf_sql_integer" />
</cfquery>
<cfreturn results />
</cffunction>
</cfcomponent>
As you can see, this component has pretty standard CRUD queries, with the exception of the updateFriendSlowly() method, which performs a SLEEP() during the UPDATE statement. This SLEEP() call is needed in order to hold the transaction lock on the given record open so that subsequent UPDATES will timeout.
You may also notice that I am setting the "innodb_lock_wait_timeout" MySQL server variable to 2-seconds in the normal UPDATE method. Again, this is just done in order to lay the groundwork for the inevitable lock timeout in the demo.
Now that we have our run-of-the-mill database Gateway component, I want to create a proxy - or wrapper - that passes-through all calls, but wraps each call in lock-timeout retry semantics. For this, we could use something like ColdFusion's onMissingMethod() handler. But, I don't love the onMissingMethod() method (for strictly subjective reasons). Instead, I prefer to create proxies with ColdFusion's getFunctionCalledName() function. For me, there's something comforting about knowing that the methods exist, even if it's still just meta-programming at heart.
That said, I created a RetriableGateway.cfc ColdFusion component that accepts a target component and an error code to check (assuming that each database driver will throw a different error code for lock timeout). When the RetriableGateway.cfc component is instantiated, it inspects the target component and, for each public method on the target, it creates a local copy (really just a pointer) to the proxyTemplate() method.
component
output = false
hint = "I provide automatic retry functionality around timed-out transactions."
{
// I initialize the retriable gateway with the given target component.
public any function init(
required any target,
numeric retryCount = 2,
string lockTimeoutErrorCode = "40001"
) {
variables.target = arguments.target;
variables.retryCount = arguments.retryCount;
// The database can throw all kinds of errors. In this case, we only want to
// look at the ones that relate to transactions that fail due to a lock timeout.
// Since this error may manifest differently in different database drivers, we'll
// have to compare the error codes. By default, we'll use the MySQL error code
// "40001", which represents, "ER_LOCK_DEADLOCK".
// --
// Read More: https://dev.mysql.com/doc/refman/5.7/en/error-messages-server.html#error_er_lock_deadlock
variables.lockTimeoutErrorCode = arguments.lockTimeoutErrorCode;
generateProxyMethods();
return( this );
}
// ---
// PUBLIC METHODS.
// ---
// ... methods will be duplicated and injected here ...
// ---
// PRIVATE METHODS.
// ---
// I inspect the target component and create local, public proxy methods that match
// the invocable methods on the target component.
private void function generateProxyMethods() {
// Look for public methods / closures on the target component and create a
// local proxy method for each invocable property. By explicitly stamping out
// clones of the proxy method, we don't have to rely on the onMissingMethod()
// functionality, which I personally feel makes this a cleaner approach.
for ( var publicKey in structKeyArray( target ) ) {
var publicProperty = target[ publicKey ];
if ( isInvocable( publicProperty ) ) {
this[ publicKey ] = proxyTemplate;
}
}
}
// I return the back-off duration, in milliseconds, that should be waited after
// the given attempt has failed to execute a transaction.
private numeric function getBackoffDuration( required numeric attempt ) {
return( 1000 * ( attempt + rand() ) );
}
// I determine if the given value is invocable.
private boolean function isInvocable( required any value ) {
return( isCustomFunction( value ) || isClosure( value ) );
}
// I provide the template for all local proxy methods.
private any function proxyTemplate() {
// For the purposes of the error message, we'll record the duration of the
// attempted proxy execution.
var startedAt = getTickCount();
// Gather the proxy invocation parameters. Since the proxyTemplate() has been
// cloned for each public method on the target, we can get the name of the
// target method by introspecting the name of "this" method.
var methodName = getFunctionCalledName();
var methodArguments = arguments;
for ( var attempt = 0 ; attempt <= retryCount ; attempt++ ) {
try {
return( invoke( target, methodName, methodArguments ) );
} catch ( database error ) {
// If this error was not a lock-timeout error, then rethrow it and let
// it bubble up to the calling context.
if ( error.errorCode != lockTimeoutErrorCode ) {
rethrow;
}
// If this was our last retry attempt on the target method, throw an
// error and let it bubble up to the calling context.
if ( attempt == retryCount ) {
throw(
type = "Database.RetryFailed",
message = error.message,
detail = "The proxy method [#methodName#] could not be successfully executed after [#( retryCount + 1 )#] attempts taking [#numberFormat( getTickCount() - startedAt )#] ms.",
errorCode = error.errorCode,
extendedInfo = serializeJson( duplicate( error ) )
);
}
// Since we're encountering a lock timeout, let's sleep the thread
// briefly and give the underlying database a chance to recover.
sleep( getBackoffDuration( attempt ) );
}
}
// CAUTION: Control flow will never get this far since the for-loop will either
// return early or throw an error by the last iteration.
}
}
As you can see, the RetriableGateway.cfc's proxyTemplate() turns around and uses ColdFusion's invoke() method to call the underlying method on the database Gateway. It simply does this in a try/catch so that it can retry the invoke() if the target gateway throws a lock wait timeout error.
With this RetriableGateway.cfc in hand, I wanted to make sure that it worked as expected in happy-path conditions:
<cfscript>
// Create our MySQL abstraction layer.
friendGateway = new FriendGateway();
// Create our retriable strategy, which will retry transactions if the underlying
// query throws a lock timeout error (rethrows all other errors).
retriableFriendGateway = new RetriableGateway( friendGateway );
// Make sure the basic proxy functionality works.
writeDump( retriableFriendGateway.clearFriends() );
writeDump( retriableFriendGateway.createFriend( "Kim", true ) );
writeDump( retriableFriendGateway.createFriend( "Samantha", true ) );
writeDump( retriableFriendGateway.createFriend( "Amanda", false ) );
writeDump( retriableFriendGateway.createFriend( "Helena", true ) );
writeDump( retriableFriendGateway.deleteFriend( 3 ) );
writeDump( retriableFriendGateway.getFriends() );
</cfscript>
Here, we're just trying the normal CRUD operations on the proxy, making sure they flow through to the underlying FriendGateway.cfc. And, when we run this code, we get the following output:
So far so good - calling the RetriableGateway.cfc works just like calling the FriendGateway.cfc directly.
Now for the fun stuff - forcing a transaction lock wait timeout and seeing if the retry semantics are implemented properly. For this, we have to use the FriendGateway.cfc's "special" update method that uses the SLEEP() function. Then, while one long-running UPDATE is executing in an asynchronous thread, we have to try and perform a conflicting update that will be blocked by the existing lock:
<cfscript>
// Create our MySQL abstraction layer.
friendGateway = new FriendGateway();
// Create our retriable strategy, which will retry transactions if the underlying
// query throws a lock timeout error (rethrows all other errors).
retriableFriendGateway = new RetriableGateway( friendGateway );
// Create the record for which we are going to generate a race-condition.
retriableFriendGateway.clearFriends();
friendID = retriableFriendGateway.createFriend( "Libby", false );
// Now, let's spwan an asynchronous, long-running query that will lock the given
// reocrd, blocking parallel queries from updating it at the same time.
thread
name = "create-race-condition"
id = friendID
{
// CAUTION: This update will run for 20-seconds!
retriableFriendGateway.updateFriendSlowly( id, "Libbster", 20 );
}
// Pause briefly, ensuring the preceeding CFThread has a chance to spawn.
sleep( 1000 );
// Now, let's try to update the record in parallel with the long-running update
// that has already been initiated.
try {
retriableFriendGateway.updateFriend( friendID, "Libbinator", true );
writeDump( retriableFriendGateway.getFriends() );
} catch ( any error ) {
writeDump( error );
}
</cfscript>
As you can see, the long-running UPDATE will block for about 20-seconds, which is longer than the retry mechanics will allow for (2 retries with a random back-off will only wait for a total of about 10-seconds). As such, we should see an error get thrown. And, in fact, when we run this code, we get the following ColdFusion error:
The second UPDATE never completed successfully; but, from the error detail, we can see that it did attempt the invoke() call 3 times and blocked for about 11-seconds.
Now, to make sure that the happy-path works - with retries - let's lower the SLEEP() time to 10-seconds. This way, the subsequent UPDATE won't work on the first attempt; but, it should work on one of the retry attempts. And, in fact, when we run the code again - with a 10-second SLEEP() - we can see the subsequent UPDATE go through:
As you can see, the page blocked for just over 10-seconds. However, the retry logic in the RetriableGateway.cfc was able to eventually execute a successful UPDATE statement.
When I first heard people talk about retrying failed database transactions, I always thought it would be a tremendous amount of work and require a lot of code. But, with the meta-programming capabilities in ColdFusion, I think this kind of behavior can actually be added with a fairly low level of effort. This experiment has given me plenty of existing stuff to noodle on.
Want to use code from this post? Check out the license.
Reader Comments
@All,
After completing this experiment, I realized that the only part of this that pertains to MySQL is the checking of the error type. If I could factor that out, I could make the retry logic a bit more generic / broadly applicable:
www.bennadel.com/blog/3309-creating-a-generic-proxy-for-retry-semantics-in-coldfusion.htm
Seems like an interesting approach.
Ben,
I like your solution, especially the increasing back-off duration.
In his book, "Release It!" (despite the upbeat title, basically a book of programming horror stories), Michael Nygard argues against retrying queries that have timed out based on the idea that they are unlikely to succeed again right away and that could put further stresses on the system (if I remember his reasoning correctly).
My own observations, however, suggest that sometime an immediate retry will likely work and could be effective (though he clearly has vastly more experience than I do).
Are you planning to retry all queries that time out or just some? If some, how are you going to decide which should be retried and which should not?
@Steve,
It's funny you should bring that up -- I was actually just thinking about this over the weekend. More specifically about how retrying queries would dovetail with the concept of Circuit Breakers which Nygard talks about in his book (and yes, it is totally a book of horror stories!!). The basis of the Circuit Breaker is that it can intercept all errors and use them to update the health of the Circuit; but, if a gateway were retrying queries on it's own, would the Circuit Breaker get to know about it?
That said, you could also ask the question about whether or not a "lock wait timeout" is a "true error". Meaning, many of the Circuit Breaker implementations come with a means to differentiate true errors (ex, Network Failure) from expected domain errors like "Not Found" or "Already Exists". Would a wait timeout be considered a true error? Or, an expected domain error?
Honestly, I'm not exactly sure. That said, the Retry algorithm, which retries and then throws another error -- "Database.RetryFailed" -- could definitely be caught by the Circuit Breaker.
Much food for thought!