Enforcing Unique Naming Constraints Using A SERIALIZABLE Transaction In Lucee CFML 5.3.7.47
The other day, I had to write a SQL script to clean up some "dirty data" in our database that violated a unique-naming constraint imposed by the business logic. When I looked at the ColdFusion code that created the dirty data, I noticed that it wasn't using a transaction
tag. Now, as much as I love databases and writing SQL, I'm not super confident when it comes to certain aspects of transactions - namely, isolation levels. As such, I wanted to experiment a little with using transaction isolation in order to enforce unique naming constraints in MySQL 5.7.32 and Lucee CFML 5.3.7.47
UNIQUE KEY
to Enforce Unique Naming Constraints?
Why Not Use a Before we look at transaction isolation, let's address the elephant in the room: if our application needs to enforce unique naming constraints, why not just apply a unique index to the column (or set of columns) in question? And to be fair, that's a totally legitimate solution - a UNIQUE KEY
on the column(s) would enforce unique values in the data-table. In fact, at InVision, we have a UNIQUE KEY
on the email
that users create as part of their login credentials.
But, that approach isn't always possible - at least not with MySQL InnoDB. For example, what if the application's business logic only applies the uniqueness constraint to a subset of records? For example, imagine that the entities within the application can be archived using an isArchived
boolean column; and, imagine that the uniqueness constraint only applies to the active records (ie, those where isArchived=0
). In MySQL - as far as I know - there's no way to define an index that only affects conditional rows within a table.
Another reason that a UNIQUE KEY
may not be desirable is simply when the only value of the index is the uniqueness constraint. Indexes have a cost both in terms of storage and write-performance. That cost is generally worthwhile because of the massive performance gains that a database index can provide during a table scan. But, if the "unique" column isn't being searched (ie, we're never looking up rows based on the "unique" value), then we're not leveraging the main benefit of the index. As such, we'd be incurring all of the cost associated with the index maintenance but reaping little of the benefit. In such cases, the overall cost of the uniqueness constraint may make more sense to be pushed up into the application layer.
ASIDE: The
UNIQUE KEY
on the
One final reason that a UNIQUE KEY
on a column may not be desirable is if the column is too wide for an index. To be honest, I don't understand this aspect of index very well at all; but, from what I can see in the MySQL documentation, some VARCHAR
indexes can only contain up to 255 characters:
The index key prefix length limit is 767 bytes for InnoDB tables that use the
REDUNDANT
orCOMPACT
row format. For example, you might hit this limit with a column prefix index of more than 255 characters on aTEXT
orVARCHAR
column, assuming autf8mb3
character set and the maximum of 3 bytes for each character.
So - at least in some cases - if the column in question is more than 255 characters, a UNIQUE KEY
index may not be sufficient for enforcing unique values. Though, it does appear that there are ways around that; but, again, my understanding here is very limited.
Why Not Use a Distributed Lock to Enforce Unique Naming Constraints?
If we're going to rule-out a UNIQUE KEY
index for the purposes of our discussion, the next option might be to use a distributed lock to synchronize access to the code that mutates the database records. This would totally work; and, in fact, is an approach that I've used in the past. But, distributed locks add complexity and they add another point-of-failure. So, while I used to lean on locks a lot more often, for the last few years I've been trying to emphasize idempotent workflows without distributed locking.
NOTE: A distributed lock is only needed when you're operating more than one application server. If you only have a single ColdFusion server operating on the database, a simple
CFLock
tag would be sufficient for synchronization.
Enforcing Unique Naming Constraints Using SERIALIZABLE Transactions
Now that we've addressed our elephants, let's consider unique naming constraints in the application layer within our business logic. To set the premise for this exploration, imagine that we have a widget
table with the following schema:
CREATE TABLE `widget` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`userID` int(10) unsigned NOT NULL,
`name` varchar(100) NOT NULL,
`createdAt` datetime NOT NULL,
`updatedAt` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `IX_byUser` (`userID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
In this table, the userID
is the "owner" of the widget. And, in order to quickly locate all widgets owned by a particular user, we have an index on the userID
column. The name
column is the column in which we want to enforce unique values; however, we only want to enforce unique values under a given user. So, essentially, we want the (userID
,name
) tuple to be unique across the entire table.
To this end, when a user goes to create a new widget, we want our business logic to look something like this:
- Check to see if there's an existing widget record with the given name (owned by the given user).
- If so, throw an error.
- If not, allow the new widget record to be created.
To help separate out the interesting aspects from the mundane aspects within this exploration, I've created a set of CRUD (Create, Read, Update, Delete) functions for interacting with the widget
table:
<cfscript>
/**
* I create a new widget record for the given user.
*/
public numeric function createWidget(
required numeric userID,
required string name
) {
```
<cfquery result="local.insertResult" datasource="testing">
INSERT INTO
widget
SET
userID = <cfqueryparam value="#userID#" sqltype="integer" />,
name = <cfqueryparam value="#name#" sqltype="varchar" />,
createdAt = UTC_TIMESTAMP(),
updatedAt = UTC_TIMESTAMP()
</cfquery>
```
return( insertResult.generatedKey );
}
/**
* I get all the widgets for the given user.
*/
public query function getWidgetsByUserID( required numeric userID ) {
```
<cfquery name="local.results" datasource="testing">
SELECT
w.id,
w.userID,
w.name
FROM
widget w
WHERE
w.userID = <cfqueryparam value="#userID#" sqltype="integer" />
</cfquery>
```
return( results );
}
/**
* I determine if a widget with the given name already exists under the given user.
*/
public boolean function isWidgetWithNameExists(
required numeric userID,
required string name
) {
```
<cfquery name="local.results" datasource="testing">
SELECT
1
FROM
widget w
WHERE
w.userID = <cfqueryparam value="#userID#" sqltype="integer" />
AND
w.name = <cfqueryparam value="#name#" sqltype="varchar" />
</cfquery>
```
return( !! results.recordCount );
}
</cfscript>
We'll use these CRUD methods to create our test. What we want to try and do is create a race-condition in which two different asynchronous threads attempt to create a new widget with a given name. This would serve to simulate the notorious "double submit" that can occur when a user double clicks on a <form>
's submit button (and there is no client-side code to catch-and-dismiss such an action).
In ColdFusion, this is quite easy to do with the CFThread
tag. In the following test, we're going to generate a random userID
and a unique name
and then try to run the same exact insert workflow in two different asynchronous threads:
<cfscript>
include "./db-functions.cfm";
// We're going to have two CFThread tags race to insert this record.
userID = randRange( 1111, 9999 );
widgetName = "widget-#createUniqueId()#";
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
thread
name = "competingThreadOne"
userID = userID
widgetName = widgetName
{
transaction isolation = "serializable" {
// MySQL is going to LOCK the index range scanned during this SERIALIZABLE
// transaction, which in our case - with an index on `userID` - is going to
// be all the rows that are owned by the given user.
if ( isWidgetWithNameExists( userID, widgetName ) ) {
throw( type = "NameAlreadyExists" );
}
// Race condition - both competing threads MAY HAVE time to get to this
// point, each thinking that the given widget does NOT EXIST yet.
// --
// NOTE: We can use a sleep() to make sure both competing threads get to
// this point while both transactions are being held open.
// sleep( 2000 );
createWidget( userID, widgetName );
}
}
thread
name = "competingThreadTwo"
userID = userID
widgetName = widgetName
{
transaction isolation = "serializable" {
// MySQL is going to LOCK the index range scanned during this SERIALIZABLE
// transaction, which in our case - with an index on `userID` - is going to
// be all the rows that are owned by the given user.
if ( isWidgetWithNameExists( userID, widgetName ) ) {
throw( type = "NameAlreadyExists" );
}
// Race condition - both competing threads MAY HAVE time to get to this
// point, each thinking that the given widget does NOT EXIST yet.
// --
// NOTE: We can use a sleep() to make sure both competing threads get to
// this point while both transactions are being held open.
// sleep( 2000 );
createWidget( userID, widgetName );
}
}
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
// Wait for both competing threads to re-join the parent thread.
thread action = "join";
dump(
label = "Widgets by User",
var = getWidgetsByUserID( userID ),
metainfo = false
);
dump(
label = "CFThreads",
var = cfthread
);
</cfscript>
Because both of the CFThread
tag-bodies are executing in parallel, there's a good chance that both threads will run the isWidgetWithNameExists()
method at roughly the same time. As such, they will both return false
; and then both threads will try and proceed to the createWidget()
call at the same time.
Without a parent transaction
, this would lead to "dirty data" in which two, duplicate records would be created. However, in this case, I am wrapping the workflow in a CFTransaction
tag with SERIALIZABLE
isolation. According to the MySQL documentation, this means:
This level is like
REPEATABLE READ
, but InnoDB implicitly converts all plainSELECT
statements toSELECT ... LOCK IN SHARE MODE
ifautocommit
is disabled.... UsingLOCK IN SHARE MODE
sets a shared lock that permits other transactions to read the examined rows but not to update or delete them.... InnoDB locks the index range scanned, using gap locks or next-key locks to block insertions by other sessions into the gaps covered by the range.
This means that the SELECT
statement within our isWidgetWithNameExists()
method call ends up locking all the rows scanned by our index. And, since we have an index on userID
, MySQL locks all the rows with the given userID
value. So, even if both CFThread
tags enter the CFTransaction
tag at the same time, only one of them will obtain the row-level locks on the userID
value. Which, in turn, means that the subsequent call to the createWidget()
method will ultimately fail for one of the asynchronous threads.
Ultimately, this leads to a deadlock for one of the threads, resulting in the following SQL error:
Deadlock found when trying to get lock; try restarting transaction
If unhandled, this would result in a 500 Server Error
for the user, which is a poor user experience (UX). But, we can always add client-side logic to prevent the double-submission in the first place; or, we can add server-side logic that retries the creation workflow (ie, restarts the transaction) and then handles the thrown NameAlreadyExists
custom error more gracefully.
By using the SERIALIZABLE
isolation level in our CFTransaction
tag, we are able to prevent dirty data. But, this comes at a cost - remember, this isolation locks all the rows owned by a given userID
value. Which means, if other asynchronous threads / requests are attempting to run UDPATE
or DELETE
statements on any one of those rows at the same time, they may also run into deadlocks. To cut down on these race-conditions, the logic within your CFTransaction
tags should be as fast as possible and do as little as possible so that the transaction
can be committed as soon as possible thereby freeing up the locks.
In 99% of cases, when I use the CFTransaction
tag, I'm doing it to enforce all-or-nothing updates on the database. Meaning, I'm not using the transaction to enforce uniqueness constraints, I'm using it to create atomic mutations on the data. As such, my mental model for isolation level is somewhat lacking. Hopefully, I haven't said anything here that is blatantly wrong or misleading. If so, please let me know!
REPEATABLE_READ
The Default Isolation Level in MySQL is The way that I understand these isolation levels, if this demo used:
transaction isolation = "repeatable_read" { ... }
Or, just used the CFTransaction
tag without an explicit isolation (allowing MySQL to use the default isolation level):
transaction { ... }
... then we could end up with dirty data. This is because the REPEATABLE_READ
isolation level is not converting our SELECT
statement into a SELECT ... LOCK IN SHARE MODE
statement. As such, our userID
row-scan in isWidgetWithNameExists()
wouldn't end up locking rows, which could allow both INSERT
statements in our createWidget()
call to succeed.
But, again, my mental model for isolation levels in transactions is very poor. It's definitely something I need to work on.
Want to use code from this post? Check out the license.
Reader Comments
@All,
After this post, I wanted to do a quick follow-up post to look specifically at the scope of the row-locking applied by the
SERIALIZABLE
transaction. In this post, I claimed that it was locked-down theuserID
- but, I wanted to actually demonstrate that this was true:www.bennadel.com/blog/4122-exploring-the-scope-of-serializable-transaction-row-locking-in-lucee-cfml-5-3-7-47.htm
This follow-up updates multiples rows across multiple users in order to show that the execution times don't interact.
@All,
Another follow-up on the scope of row-locking - it seems that if a given slice of the index is empty - ie, there are now rows yet associated with a given
userID
(in our demo) - then we can end-up getting deadlocks:www.bennadel.com/blog/4133-the-scope-of-serializable-transaction-row-locking-is-larger-when-rows-dont-yet-exist-in-mysql-5-7-32.htm
This appears to be related to the way in which MySQL has to lock something called
supermum
. I don't really understand what this means; but, I demonstrate the change in the above link.