Experimenting With Compress() And Uncompress() In MySQL
The other day, I was talking to Alexander Rubin from Percona about some MySQL optimization techniques when he mentioned that MySQL has compress() and uncompress() functions. I had never heard of these before; but, they do exactly what you think they might - compress and uncompress text values. This seems like it may be useful for tables that store a lot of long-text data, such as an error log.
Since this is just an experiment, and the first time that I have ever used these functions, I won't pretend to have any advice on when it's best to use them. But, I can point you to this Percona blog post that talks about some of the considerations in terms of performance and storage.
That said, I'm still interested in trying it out. So, going back to the idea of an error log that has to store lots of stack traces, exception objects, and meta data, it feels like the kind of context in which data-compression would make sense. After all, I'm not querying this data very often; and, when I do query it, I never have to search on those "data" fields - I just have to get them out of the database and display them on the page.
To start exploring, I want to create my data-access object - the ColdFusion component which moves data into and out of the database. In the MySQL documentation, it recommends that the compressed data be stored in a Blob or VarBinary column type. So, here's the data-table that I came up with:
CREATE TABLE `error_log` (
`id` INT(1) UNSIGNED NOT NULL AUTO_INCREMENT,
`message` VARCHAR(100) NOT NULL,
`errorData` BLOB NOT NULL,
`metaData` BLOB NOT NULL,
`ipAddress` VARCHAR(35) NOT NULL,
`createdAt` DATETIME NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Notice that the "errorData" and "metaData" fields are both Blobs.
The Compress() function expects a string. This means that we have to take our server-side data and serialize it before storing it. JSON (JavaScript Object Notation) feels like a natural choice. When we get the data out of the database, however, it comes out as a Blob. If our intention is to deserialize it (which it is), we have to convert it from a byte array into a JSON string. We could do this in ColdFusion or in the database; but, seeing as we're already mucking around with the data in SQL, I figured we could accomplish this with the Convert() function:
<!---
Table structure for the error log:
CREATE TABLE `error_log` (
`id` INT(1) UNSIGNED NOT NULL AUTO_INCREMENT,
`message` VARCHAR(100) NOT NULL,
`errorData` BLOB NOT NULL,
`metaData` BLOB NOT NULL,
`ipAddress` VARCHAR(35) NOT NULL,
`createdAt` DATETIME NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
--->
<cfcomponent
output="false"
hint="I provide a repository for error logging.">
<!--- I get the most recent errors in the error log. --->
<cffunction name="getRecentErrors" access="public" returntype="query" output="false">
<!--- Define arguments. --->
<cfargument name="count" type="numeric" required="false" default="10" />
<!---
When getting the error log items, we have to be sure to decompress the error
data, returning it as a string.
--->
<cfquery name="local.result" datasource="testing">
SELECT
e.id,
e.message,
<!---
The compressed values are being stored as Binary (Blob). To make
these easier to consume, we're going to convert the Blob values to
strings on the way out.
--->
CONVERT( UNCOMPRESS( e.errorData ) USING 'utf8' ) AS errorData,
CONVERT( UNCOMPRESS( e.metaData ) USING 'utf8' ) AS metaData,
e.ipAddress,
e.createdAt
FROM
error_log e
ORDER BY
e.id DESC
LIMIT
<cfqueryparam value="#count#" cfsqltype="cf_sql_integer" />
</cfquery>
<cfreturn result />
</cffunction>
<!---
I log the given error and related data to the error log. The errorData and
metaData will be serialized as JSON (JavaScript Object Notation) when persisted.
As such, you need to make sure the given data can be properly serialized in
ColdFusion using serializeJson().
--->
<cffunction name="logError" access="public" returntype="void" output="false">
<!--- Define arguments. --->
<cfargument name="message" type="string" required="true" />
<cfargument name="errorData" type="any" required="true" />
<cfargument name="metaData" type="any" required="true" />
<cfargument name="ipAddress" type="string" required="true" />
<cfargument name="createdAt" type="date" required="false" default="#now()#" />
<!--- Clean up any data restrictions to prevent Database errors. --->
<cfset message = left( message, 100 ) />
<cfset ipAddress = left( ipAddress, 35 ) />
<!--- Serialize the data structures for storage. --->
<cfset errorData = serializeJson( errorData ) />
<cfset metaData = serializeJson( metaData ) />
<!---
When storing error, we're going to compress the long-text values of the
error-data and the meta-data. These don't get queried a lot, so there
shouldn't be too much of a performance concern.
--->
<cfquery name="local.result" datasource="testing">
INSERT INTO error_log
(
message,
errorData,
metaData,
ipAddress,
createdAt
) VALUES (
<!--- message. --->
<cfqueryparam value="#message#" cfsqltype="cf_sql_varchar" />,
<!--- errorData. --->
COMPRESS( <cfqueryparam value="#errorData#" cfsqltype="cf_sql_longvarchar" /> ),
<!--- metaData. --->
COMPRESS( <cfqueryparam value="#metaData#" cfsqltype="cf_sql_longvarchar" /> ),
<!--- ipAddress. --->
<cfqueryparam value="#ipAddress#" cfsqltype="cf_sql_varchar" />,
<!--- createdAt. --->
<cfqueryparam value="#createdAt#" cfsqltype="cf_sql_timestamp" />
)
</cfquery>
</cffunction>
</cfcomponent>
Notice that we can still use the ColdFusion CFQueryParam tag in conjunction with the Compress() function. No need to trade-in SQL-injection protection for compression!
To test out my new Logger.cfc ColdFusion component, I created a little error log viewer that self-populates. The page shows the most recently logged errors; but, it also creates a new error on every page-load.
<cfscript>
logger = new Logger();
// For the sake of the demo, we're going to create an error on each load of the page.
try {
// Divide by zero - are you mad, man?!
result = 50 / 0;
} catch ( any error ) {
// Log the error - here we can pass in the error object as well as any arbitrary
// data that may be relevent, as part of the metaData object.
logger.logError(
error.message,
error,
{
user = "Ben Nadel",
requestData = getHttpRequestData(),
cgi = cgi
},
cgi.remote_addr
);
}
// Get recent errors to display in our error log. When we get the data back from the
// database, it will still be serialized; we will have to deserialize it ourselves
// when outputtting.
recentErrors = logger.getRecentErrors();
</cfscript>
<!--- Reset the output buffer. --->
<cfcontent type="text/html; charset=utf-8" />
<cfoutput>
<!doctype html>
<html>
<head>
<meta charset="utf-8" />
<title>
Experimenting With MySQL Compress() And Uncompress() Methods In ColdFusion
</title>
</head>
<body>
<h1>
Experimenting With MySQL Compress() And Uncompress() Methods In ColdFusion
</h1>
<table width="100%" border="1" cellpadding="5" cellspacing="1">
<thead>
<th>
ID
</th>
<th>
Created
</th>
<th>
IP Address
</th>
<th>
Message
</th>
<th width="50%">
Error Data
</th>
<th width="50%">
Meta Data
</th>
</thead>
<tbody>
<cfloop query="recentErrors">
<tr valign="top">
<td>
#recentErrors.id#
</td>
<td style="white-space: nowrap ;">
#dateFormat( recentErrors.createdAt, "Mmm d" )# at
#timeFormat( recentErrors.createdAt, "HH:mm:ss" )#
</td>
<td>
#recentErrors.ipAddress#
</td>
<td style="white-space: nowrap ;">
#htmlEditFormat( recentErrors.message )#
</td>
<td>
<!--- Deserialize the data. --->
<cfdump
var="#deserializeJson( recentErrors.errorData )#"
expand="false"
/>
</td>
<td>
<!--- Deserialize the data. --->
<cfdump
var="#deserializeJson( recentErrors.metaData )#"
expand="false"
/>
</td>
</tr>
</cfloop>
</tbody>
</table>
</body>
</html>
</cfoutput>
When I catch the error, I am just passing it off to the Logger.cfc instance that subsequently serializes it as JSON (JavaScript Object Notation). I should mention that I have heard that ColdFusion sometimes has trouble serializing exception objects. This is something that I have heard in passing and my not be true; or may no longer be relevant in recent releases of ColdFusion. In any case, this demo works as expected (running ColdFusion 10).
Compression seems like a really interesting aspect of MySQL. It's probably not appropriate for VARCHAR fields, which does limit the number of places that I could even consider using it. But, I do have some data tables that just warehouse large chunks of data (ex, error log); and this seems like it could be an possible win.
Real-World Experiment
In the Percona blog post, linked above, the author mentions that you can run a test on the table to see what kind of storage gains you might see:
Before going ahead with compression I usually run some checks to see how much compression will benefit you. SELECT AVG(LENGTH(body)) FROM data vs. SELECT AVG(LENGTH(COMPRESS(body))) FROM data gives good data point. However it may not present you with true data size and performance gains, especially for Innodb tables.
I ran this test on the exception column from one of my production error logs and I got this result:
- Average Length: 11,854
- Average Length w/ Compression: 1,733
As you can see, that's a good deal of compression. Not quite an order of magnitude; but moving in that direction.
Want to use code from this post? Check out the license.
Reader Comments
I was excited to read your post. I hadn't heard of UN/COMPRESS either. Looking forward to seeing timings to see what the impact is.
Awesome find. We've got a text heavy app that is going to be expanded so I may just need to implement this.
It will be interesting. In the near future, I'm looking to break out non-core stuff like error/event logging to its own datasource/database. Will be a good time to start digging into the practicalities of this more closely.