Avoiding MySQL max_allowed_packet Errors By Splitting-Up Large Aggregation Queries In ColdFusion

By Ben Nadel

Published 2022-03-28 in ColdFusion, SQL — Comments (4)

Yesterday, I looked at splitting an array up into groups in ColdFusion. I had to do this the other day because I was running into errors when attempting to execute a very large SQL statement in MySQL. To get around the error, I ended up splitting the query logic into chunks, executing the chunks individually, and then reducing the chunk results down to a single value. This felt like a rather elegant solution to a somewhat "dirty problem"; so, I thought it might be worth sharing the approach in ColdFusion (I was writing it in Lucee CFML 5.3.8, but the demo is authored in Adobe ColdFusion 2021).

The code I was writing was for an internal report that had to examine millions of records in the database. Furthermore, the report that I was trying to generate had to perform some fuzzy logic based on some non-indexed values that exist across a set of table that each have several billion records in them.

In short, I had to find a way to take a rather unfortunate set of data constraints and run the queries as efficiently as I could. The approach that I ended up using required me to pull back loads of primary keys from one table, run some logic on them, and then use those primary keys as part of a WHERE IN() clause in another query.

To get a sense of what I mean, a very simplified version of the SQL query looked something like this - in the following demo, I'm using the Range.cfc ColdFusion component that I discussed the other day:

<cfscript>

	// Simulate a long list of primary-key values.
	entryIDs = new Range( 1, 1000 ).toArray();

	// Gather an aggregate across all of the 1,000 IDs in the above list.
	results = queryExecute(
		"
			SELECT
				COUNT( * ) AS commentCount
			FROM
				blog_comment c
			WHERE
				c.blog_entry_id IN ( :idList )
		",
		{
			idList: {
				value: entryIDs.toList(),
				cf_sql_type: "cf_sql_integer",
				list: true
			}
		}
	);

	writeOutput( "Comment count: " & results.commentCount );

</cfscript>

As you can see, I'm using a list of primary keys, presumably gathered from another query, and then I'm using that list as the input for a WHERE IN() clause in a COUNT(*) query.

On some data sets, this report ran fine. However, as we started to run this report on larger and larger datasets, we began to see MySQL errors in the logs:

Packet for query is too large (4,293,573 > 4,194,304). You can change this value on the server by setting the 'max_allowed_packet' variable.

The list of primary keys was getting so long that ColdFusion was generating SQL statements that were too large for the MySQL server settings. At this point, I had two options: I could either completely refactor the approach that I was using, which would be a high level-of-effort; or, I could just add some more duct-tape. I went with the duct-tape option.

Since all I was running was some simple SQL aggregates, the math works out that I can divide-and-conquer the results. Meaning, a COUNT() of the overall dataset should be equal to the sum of the COUNT()s of segments of the overall dataset. And luckily, ColdFusion makes this really easy to do.

In the following code, I'm taking that list of 1,000 primary keys and I'm breaking it up into groups of 100. Then, I'm running the same query multiple times in parallel on the smaller lists using ColdFusion's asynchronous iteration feature:

<cfscript>

	// Simulate a long list of primary-key values.
	entryIDs = new Range( 1, 1000 ).toArray();

	// For the sake of the demo, let's assume that the 1,000 ID list is actually too large
	// for the MySQL max packet size. As such, we're going to split the list into smaller
	// chunks of 100, run the queries in parallel, and the sum the results!
	totalCount = splitArrayIntoGroups( entryIDs, 100 )
		.map(
			( groupIDs ) => {

				// Here, we're gathering the aggregate across a SUBSET of the IDs. Each
				// query will be mapped to an item in the aggregation array.
				var results = queryExecute(
					"
						SELECT
							COUNT( * ) AS commentCount
						FROM
							blog_comment c
						WHERE
							c.blog_entry_id IN ( :idList )
					",
					{
						idList: {
							value: groupIDs.toList(),
							cf_sql_type: "cf_sql_integer",
							list: true
						}
					}
				);

				return( results.commentCount );

			},
			true, // Parallel execution of the iteration.
			4 // Max parallel threads.
		)
		.sum()
	;

	writeOutput( "Comment count: " & totalCount );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I split the given collection into separate arrays, each of which has the given max
	* length.
	*/
	public array function splitArrayIntoGroups(
		required array collection,
		required numeric maxLength
		) {

		var collectionLength = collection.len();
		var groups = [];

		for ( var i = 1 ; i <= collectionLength ; i += maxLength ) {

			// CAUTION: The slice() method is NOT SAFE, and will throw an error if we go
			// out-of-bounds on the given collection. As such, we have to make sure that
			// our group-length argument never goes past the last index.
			var groupLength = min( maxLength, ( collectionLength - i + 1 ) );

			groups.append( collection.slice( i, groupLength ) );

		}

		return( groups );

	}

</cfscript>

Running this ColdFusion code gives us the same exact result as the previous code. Only, it performs the aggregation over several queries which allows the generated MySQL packet size to stay under the limits of the production database.

ASIDE: Apparently, in MySQL, the max_allowed_packet variable is something that can be adjusted on-the-fly using a GLOBAL value. In theory, I could have solved this problem by simply increasing this value and re-running the report. But, I didn't want to start messing with this value since that felt like it was solving the wrong problem. That said, in a pinch, increasing the limit might be the right option for your situation.

Yes, writing ColdFusion code is thrilling - we all know this. But, what I really love about this approach is that it solved the problem using a shift in perspective more than a shift in the code. Meaning, I probably could have solved the problem by adding a lot of code and changing the way that the report was run. But, instead of solving it through a fundamental refactoring, I just stood back and shifted my perspective to one of divide-and-conquer without fundamentally changing anything about the algorithm itself. And, that was a very satisfying moment for me.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4238

Reader Comments

Chris G Mar 28, 2022 at 8:13 PM

264 Comments

I love stuff like this. The key thing I learned from this article is that the map() function allows parallel processing 🤯

I recently solved a problem that required a "shift in perspective" as well. My boss came to me with a query that wasn't returning the data he expected. I tried to get his query working, but using his approach I could not. So, I sat back, reassessed, then took a different tact. Success 🎉!! It felt great!

Chris G Mar 28, 2022 at 8:15 PM

264 Comments

As a side note... when would 🍆 or 🍑 ever be the desired emoji's for your blog? haha... I just noticed that they're in your quick-picks. You don't run THAT kind of blog, Ben 😜🤣🤣

Ben Nadel Mar 28, 2022 at 8:57 PM

15,996 Comments

@Chris,

As fair warning, the parallel iteration was only added in Adobe ColdFusion in the 2021 release. It's been in Lucee CFML for the last few versions, though.

Re: I have no idea 🤣 sometimes, I guess people just need to know they have the opportunity, even if it's never warranted 🤪

Ben Nadel Apr 22, 2022 at 11:34 AM

15,996 Comments

@All,

One word of caution here - the algorithm that I'm using here to split-up the list of query-params into smaller chunks is using arraySlice(). It turns out, at large scale, the arraySlice() function has some funky performance characteristics which can be greatly improved by altering the underlying implementation details:

www.bennadel.com/blog/4253-arrayslice-has-an-exponential-performance-overhead-in-lucee-cfml-5-3-8-201.htm

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.