MySQL GROUP_CONCAT() Fails Silently When It Hits Its Size Limit
The other day, we ran into interesting little problem. We were using MySQL's GROUP_CONCAT() function to turn an aggregate list into a comma-delimited list. But things weren't working. No errors - just confusing behavior. After logging a lot of output, we realized that the GROUP_CONCAT() function was truncating its output without raising an exception.
To see what I mean, take a look at this demo. Here, we're going to create a fake transaction log that associates a transaction ID with a day of the week (bear with me on this silly example). Then, we're going to report on those transactions by grouping the data by the day of the week and collapsing the relevant transaction IDs into a single list.
<!---
First, let's populate the audit-log for testing. Here, we're doing to store the day
of the week and the transaction ID that took place.
NOTE: This is a really trite example, but I need _something_ to put in the table.
--->
<cfquery name="populate" datasource="testing">
TRUNCATE TABLE audit_log;
<cfset transactionIndex = 100000 />
<!--- For each day of the week, add 1,000 transactions. --->
<cfloop index="dayIndex" from="1" to="7" step="1">
<cfloop index="i" from="1" to="1000" step="1">
INSERT INTO audit_log
(
dayOfWeek,
transactionID
) VALUES (
#dayIndex#,
#( ++transactionIndex )#
);
</cfloop>
</cfloop>
</cfquery>
<!---
Now, let's report on the transactions that have taken place. Grouping the log by the
day of the week, we're going to concatenate the transactionIDs into a grouped-list.
--->
<cfquery name="report" datasource="testing">
SELECT
dayOfWeek,
<!--- Collapse the transaction IDs into a comma-delimited list. --->
GROUP_CONCAT( transactionID ) AS transactionList
FROM
audit_log
GROUP BY
dayOfWeek
ORDER BY
dayOfWeek ASC
</cfquery>
<!---
Output report aggregates.
NOTE: Since we added 1,000 transactions for each day (at the top of demo), each
listLen() should report 1,000.
--->
<cfoutput query="report">
Day Of Week (#report.dayOfWeek#): #listLen( report.transactionList )#<br />
</cfoutput>
As you can see from the INSERT statements, each day of the week should have 1,000 transaction IDs associated with it. And yet, when we run the above code and output the report, we get the following:
Day Of Week (1): 147
Day Of Week (2): 147
Day Of Week (3): 147
Day Of Week (4): 147
Day Of Week (5): 147
Day Of Week (6): 147
Day Of Week (7): 147
Here, each day of the week is reporting only 147 transactions, not the full 1,000 that we inserted. The problem is that GROUP_CONCAT() has a native default limit of 1,024 characters (which is further constrained by the maximum size of the packet that MySQL can return). Ultimately, though, the real problem is not that GROUP_CONCAT() has a limit, it's that it failed silently. Instead of throwing a SQL error, it just truncated the value. And, as you can imagine, if the truncation happens mid-transactionID, we're not just dealing with missing data, we're now dealing with false data.
It's easy enough to refactor the algorithms so that they don't use GROUP_CONCAT(). But, I figured I'd share this since we were banging our heads against the wall for a while trying to figure out what was going on. In the long run, I'm not sure that I can think of a great use-case for GROUP_CONCAT(); and these limitations definitely strengthen that perception.
Want to use code from this post? Check out the license.
Reader Comments
Hi Ben,
I like group concat but people do go against it. I think because its not in the 'standard' sql.
Its quite neat/convenient in more complicated queries where you want to do a whole load of stuff and get somethings summarised. For example when I know there are only going to be a few records.
SELECT loads,of,fields, Group_concat(student.firstname) as students
from loads of tables
... more sql
loads,of,fields, 'Anne,Bob,Colin'
I know you could do that in different ways but group_concat is convenient and perhaps easier to read? for a small group.
@Adam,
I think it is a cool function. And, when I first learned about it, I was really excited since so much of programming is related to gathering groups of IDs :D I guess my biggest gripe is simply that it didn't throw some sort of error - that is just silently truncated the results :(
Yes, think your right. It looks like you can tweak that length, http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_group_concat_max_len
but it seems better throw an exception than fail or truncate silently
@All,
Unfortunately, it looks like MySQL's GROUP_CONCAT() is even more problematic and fails even more mysteriously when it's used in conjunction with a UNION or UNION ALL clause:
www.bennadel.com/blog/2670-mysql-group-concat-has-buggy-interaction-with-union-clauses.htm
This erratic behavior makes it unusable.
I think you're possibly blaming the tool rather than the workman here, Ben. As has been pointed out, it's documented as working the way it does (whether that is ideal or not is another thing). So it's not *failing*. It's doing exactly what it is spec'ed to do.
I see what you're saying here is a failure is *similar* to row-limiting a result set and then being vexed that an exception isn't being thrown if there are indeed more rows that would have been returned if the row-limit wasn't in place.
I would imagine the length-limit is in place because the operation is probably an expensive one, and it's less than idea l to it beyond a point (like limiting list sizes in a WHERE IN clause). It might be worth googling whether it's known to be a function to be avoided for this reason? (I have not done so).
Your follow-up blog article sounds more interesting though.
--
Adam
@Adam,
Totally agree - the function is working exactly as it is documented. My gripe is simply that the implementation seems sub-optimal. Using this function was a poor choice, which we unfortunately didn't catch until it was in production. And, since I've never used it before, it didn't set off any red-flags when I was doing code-review.
I precede all my uses of GROUP_CONCAT with
<cfquery datasource="test">
SET @@group_concat_max_len = 9999999;
</cfquery>
It's a great tool for de-normalising data, such as when populating a search index.
@Julian,
thanks for mentioning the SET @@group_concat_max_len = 9999999;
I had this idea before, but I am not sure about the following:
It's obvious, when you precede all the GROUP_CONCAT queries with that group_concat_max_len QUERY, the lenght limit is omitted. Great. But do you know for how long this setting is keept?
Is it only for the very next cfquery? Or is it keept, e.g. until mysql server is restarted? or only for the user used in the cfquery? only for the application it is defined within or for all applications using this datasource?
I did not find any documentation about this, and I wanted to share this thought, maybe it is interesting for somebody out there ;-)
@Raffael
Using @@ before the variable in the SET statement restricts the change to the current database connection rather than overriding the global server setting. How long the connection will last depends on your settings, but I think the default is normally 20 minutes for Adobe ColdFusion.
If you want to override the setting permanently then you could do that in your my.conf file.