Running MySQL Compress() And Uncompress() Compatible Methods In ColdFusion

By Ben Nadel

Published 2016-09-26 in ColdFusion, SQL — Comments (1)

A couple of weeks ago, Meghana asked me about implementing MySQL's Compress() and Uncompress() algorithms in Java. In the past, I've looked at using Compress() and Uncompress() in MySQL; but, I've never thought about actually trying to implement it in code. As such, I thought it would be a fun experiment. Now, I was specifically asked about implementing this in Java; but, since ColdFusion is built on top of Java, it seems like a ColdFusion-based exploration should cover both use-cases.

From the information I've been able to find, it seems that MySQL (at least by default) uses the ZLIB compression library for the Compress() and Uncompress() function implementations. But, it doesn't just store the raw ZLIB compression result - it stores both the compressed data and the length of the original input in a single binary value.

The binary representation of the original input length is stored in low-byte-first order. This mean, at least in Java - which represents Integers using Big Endian - we have to reverse the order of the bytes before we prepend them to the compressed data. Which, of course, means that when we parse this data, we'll have to reverse the bytes again in order to determine the length of the original input.

Now, according to the MySQL documentation, the compressed value is appended with a "." (period) to prevent accidental white-space trimming when stored in a VARCHAR (or related) field. I was not, however, able to see this in any of my testing. I assume that I am misunderstanding the documentation. As such, it is very possible that my ColdFusion-based implementation of the Compress() and Uncompress() functions are incomplete.

That said, I was able to get something to work. In order to test it, I took several values through the ColdFusion-based compression life-cycle. Then, I compared the resultant binary values to the binary values produced by MySQL's Compress() function:

<!---
	Set up the test values that we will put through the ColdFusion-based compress() and
	uncompress() life-cycle.
	--
	CAUTION: According to the MySQL documentation, it appears that an extra "." may be
	placed at the end of the values to prevent space-trimming; however, I was never able
	to reproduce this edge-case. As such, I am not entirely sure that this case is
	properly handled in the ColdFusion code.
--->
<cfset valueA = "" />
<cfset valueB = "." />
<cfset valueC = "Hello world!" />
<cfset valueD = " give me space " />
<cfset valueE = repeatString( " ", 2000 ) />
<cfset valueF = repeatString( "blam", 111222 ) />


<!--- Create the ColdFusion-based, MySQL-compatible compression component. --->
<cfset compressor = new MySqlCompressor() />

<!--- Use ColdFusion to COMPRESS the test values (creates binaries values). --->
<cfset cfCompressedA = compressor.compress( valueA ) />
<cfset cfCompressedB = compressor.compress( valueB ) />
<cfset cfCompressedC = compressor.compress( valueC ) />
<cfset cfCompressedD = compressor.compress( valueD ) />
<cfset cfCompressedE = compressor.compress( valueE ) />
<cfset cfCompressedF = compressor.compress( valueF ) />

<!--- Use ColdFusion to UNCOMPRESS the test values. --->
<cfset cfUncompressedA = compressor.uncompress( cfCompressedA ) />
<cfset cfUncompressedB = compressor.uncompress( cfCompressedB ) />
<cfset cfUncompressedC = compressor.uncompress( cfCompressedC ) />
<cfset cfUncompressedD = compressor.uncompress( cfCompressedD ) />
<cfset cfUncompressedE = compressor.uncompress( cfCompressedE ) />
<cfset cfUncompressedF = compressor.uncompress( cfCompressedF ) />

<!---
	Assert that the full-life-cycle Uncompressed values match the original input
	values. This ensures that ColdFusion can figure out how to convert to and from the
	compressed format (from ColdFusion's perspective).
--->
<cfif (
	( valueA neq cfUncompressedA ) ||
	( valueB neq cfUncompressedB ) ||
	( valueC neq cfUncompressedC ) ||
	( valueD neq cfUncompressedD ) ||
	( valueE neq cfUncompressedE ) ||
	( valueF neq cfUncompressedF )
	)>

	ColdFusion's local compression life-cycle failed.
	<cfabort />

</cfif>


<!---
	If we've made it this far, the ColdFusion code thinks that it can compress and
	uncompress values; however, we don't yet know if it is actually compatible with
	MySQL's COMPRESS() and UNCOMPRESS() functions. To test this, we'll let MySQL
	compress the same values and then compare the binaries.
--->
<cfquery name="mySql" datasource="testing">
	SELECT
		COMPRESS( <cfqueryparam value="#valueA#" cfsqltype="cf_sql_varchar" /> ) AS compressedA,
		COMPRESS( <cfqueryparam value="#valueB#" cfsqltype="cf_sql_varchar" /> ) AS compressedB,
		COMPRESS( <cfqueryparam value="#valueC#" cfsqltype="cf_sql_varchar" /> ) AS compressedC,
		COMPRESS( <cfqueryparam value="#valueD#" cfsqltype="cf_sql_varchar" /> ) AS compressedD,
		COMPRESS( <cfqueryparam value="#valueE#" cfsqltype="cf_sql_varchar" /> ) AS compressedE,
		COMPRESS( <cfqueryparam value="#valueF#" cfsqltype="cf_sql_varchar" /> ) AS compressedF
	;
</cfquery>

<!---
	MySQL has returned Blob (binary) values. Now, we just need to see if the binary
	values returned by MySQL match the binary values returned by the ColdFusion-based
	compressor. Since we can't compare the binary values directly, we'll compare the
	HEX-encoding of both sets of values.
--->
<cfif (
	( binaryEncode( cfCompressedA, "hex" ) neq binaryEncode( mySql.compressedA, "hex" ) ) ||
	( binaryEncode( cfCompressedB, "hex" ) neq binaryEncode( mySql.compressedB, "hex" ) ) ||
	( binaryEncode( cfCompressedC, "hex" ) neq binaryEncode( mySql.compressedC, "hex" ) ) ||
	( binaryEncode( cfCompressedD, "hex" ) neq binaryEncode( mySql.compressedD, "hex" ) ) ||
	( binaryEncode( cfCompressedE, "hex" ) neq binaryEncode( mySql.compressedE, "hex" ) ) ||
	( binaryEncode( cfCompressedF, "hex" ) neq binaryEncode( mySql.compressedF, "hex" ) )
	)>

	ColdFusion blobs DO NOT MATCH MySQL blobs.
	<cfabort />

</cfif>


<!---
	If we made it this far, we know that the compressed binaries produced by ColdFusion
	and MySQL match. This proves that they compress with compatible algorithms. And,
	since we also know that ColdFusion can uncompress its own binaries, it follows
	logically that it should be able uncompress in a way that is compatible with MySQL
	as well.
--->
Done! All values match! All life-cycles are compatible.

As you can see, I'm running .compress() and .uncompress() in ColdFusion to make sure that ColdFusion's implementation is functional in both directions. But, of course, this doesn't demonstrate that it's actually compatible with MySQL's implementation. That's why I then use MySQL to convert the inputs to binary for a hex-based comparison to ColdFusion's binaries. By the law of substitution, if ColdFusion and MySQL produce the same binaries; and, if ColdFusion can properly uncompress its own binaries; then, it logically follows that ColdFusion should be able to uncompress MySQL's binary values.

When we run this ColdFusion code, we get the following page output:

Done! All values match! All life-cycles are compatible.

Awesome sauce! It seems to work as expected. Here's my ColdFusion implementation of the MySQL Compress() and Uncompress() functions:

component
	output = false
	hint = "I provide MySQL compatible compress() and decompress() methods."
	{

	/**
	* I initialize the compressor service.
	*
	* @bufferSize I am the size of the buffer to use during the compression actions.
	* @output false
	*/
	public any function init( numeric bufferSize = 1024 ) {

		// This value acts a default for the buffer size and can be overridden in the
		// individual inflation and deflation method invocations.
		DEFAULT_BUFFER_SIZE = bufferSize;

		return( this );

	}


	// ---
	// PUBLIC METHODS.
	// ---


	/**
	* I compress the given string, returning the MySQL representation of the compressed
	* binary. The binary value contains both the original length of the input and the
	* compressed data, which is how MySQL stores it.
	*
	* @input I am the string value being compressed.
	* @output false
	*/
	public binary function compress( required string input ) {

		// If the input is empty, return an empty binary - we don't bother storing the
		// length for empty strings.
		if ( ! len( input ) ) {

			return( binaryNew( 0 ) );

		}

		var lengthBytes = lengthToBinary( len( input ) );
		var compressedBytes = zlibCompress( input );

		return( binaryConcat( lengthBytes, compressedBytes ) );

	}


	/**
	* I extract the compressed bytes from the given MySQL representation of the
	* compressed value (which combines both the length of the original input and the
	* compressed data).
	*
	* CAUTION: This method does not take into account the possible empty byte-array
	* if the original input was an empty string. A non-zero length is assumed.
	*
	* @input I am the MySQL representation of the compressed value.
	* @output false
	*/
	public binary function getCompressedBytes( required binary input ) {

		// Slice off 5 -> N bytes.
		return( binarySlice( input, 5, ( arrayLen( input ) - 5 ) ) );

	}


	/**
	* I extract the original-length bytes from the given MySQL representation of the
	* compressed value (which combines both the length of the original input and the
	* compressed data).
	*
	* CAUTION: This method does not take into account the possible empty byte-array
	* if the original input was an empty string. A non-zero length is assumed.
	*
	* @input I am the MySQL representation of the compressed value.
	* @output false
	*/
	public binary function getLengthBytes( required binary input ) {

		// Slice off 1 -> 4 bytes.
		return( binarySlice( input, 1, 4 ) );

	}


	/**
	* I uncompress the given MySQL representation of the compressed value, returning
	* the original input string.
	*
	* @input I am the MySQL binary representation of the compressed value.
	* @output false
	*/
	public string function uncompress( required binary input ) {

		// If the input binary is an empty byte-array, it means that the original input
		// was an empty string. Just return the empty string.
		if ( ! arrayLen( input ) ) {

			return( "" );

		}

		return( zlibUncompress( getCompressedBytes( input ) ) );

	}


	/**
	* I return the length of the original input string for the given MySQL representation
	* of the compressed value.
	*
	* @input I am the MySQL binary representation of the compressed value.
	* @output false
	*/
	public numeric function uncompressedLength( required binary input ) {

		// If the input binary is an empty byte-array, it means that the original input
		// was an empty string. Just return zero.
		if ( ! arraylen( input ) ) {

			return( 0 );

		}

		return( lengthFromBinary( getLengthBytes( input ) ) );

	}


	// ---
	// PRIVATE METHODS.
	// ---


	/**
	* I concatenate the given binary values (in order), returning the resultant binary.
	*
	* @binaryA I am the first binary value.
	* @binaryB I am the second binary value.
	* @output false
	*/
	private binary function binaryConcat(
		required binary binaryA,
		required binary binaryB
		) {

		var length = ( arrayLen( binaryA ) + arrayLen( binaryB ) );

		var byteStream = createObject( "java", "java.io.ByteArrayOutputStream" )
			.init( javaCast( "int", length ) )
		;

		byteStream.write( binaryA );
		byteStream.write( binaryB );

		return( byteStream.toByteArray() );

	}


	/**
	* I create a new binary value (byte array) of the given length, filling it with the
	* given byte initializer.
	*
	* @length I am the length of the new binary value.
	* @fillByte I am the byte value used to fill the new byte array.
	* @output false
	*/
	private binary function binaryNew(
		required numeric length,
		numeric fillByte = 0
		) {

		var bytes = [];

		if ( length ) {

			arrayResize( bytes, length );
			arraySet( bytes, 1, length, fillByte );

		}

		return( javaCast( "byte[]", bytes ) );

	}


	/**
	* I reverse the given binary value (byte array), returning a new binary value.
	*
	* @input I am the binary value being reversed.
	* @output false
	*/
	private binary function binaryReverse( required binary input ) {

		var result = [];
		var resultLength = arrayLen( input );

		arrayResize( result, resultLength );

		for ( var i = 1 ; i <= resultLength ; i++ ) {

			result[ resultLength - i + 1 ] = input[ i ];

		}

		return( javaCast( "byte[]", result ) );

	}


	/**
	* I slice out a portion from the given binary value, returning a new binary value.
	*
	* @input I an the binary value (byte array) being sliced.
	* @index I am the ONE-BASED index at which to start slicing.
	* @length I am the number of bytes to slice.
	* @output false
	*/
	private binary function binarySlice(
		required binary input,
		required numeric index,
		required numeric length
		) {

		return( javaCast( "byte[]", arraySlice( input, index, length ) ) );

	}


	/**
	* I convert the given binary value to a number.
	*
	* @input I am the binary value representation of a number.
	* @output false
	*/
	private numeric function binaryToInt( required binary input ) {

		var result = createObject( "java", "java.math.BigInteger" )
			.init( input )
			.intValue()
		;

		return( result );

	}


	/**
	* I convert the given number to a binary value.
	*
	* @input I am the number being converted to a binary representation.
	* @output false
	*/
	private binary function intToBinary( required numeric input ) {

		var result = createObject( "java", "java.math.BigInteger" )
			.valueOf( javaCast( "int", input ) )
			.toByteArray()
		;

		return( result );

	}


	/**
	* I get the length of the original input string based on the given length bytes
	* extracted from the MySQL representation.
	*
	* @input I am the length bytes portion of the MySQL representation.
	* @output false
	*/
	private numeric function lengthFromBinary( required binary input ) {

		// MySQL stores the length bytes in a low-byte-first order. As such, we have to
		// reverse the bytes before converting them to a number representation.
		return( binaryToInt( binaryReverse( input ) ) );

	}


	/**
	* I get the binary representation of the given length for use with the compressed
	* MySQL representation.
	*
	* @length I am the length for which we are getting bytes.
	* @output false
	*/
	private binary function lengthToBinary( required numeric length ) {

		// MySQL stores the length bytes in a low-byte-first order. As such, we have to
		// reverse the byte representation of the given number.
		var lengthBytes = binaryReverse( intToBinary( length ) );

		// Make sure the resultant value is at least 4-bytes.
		var bytesForStorage = binaryConcat( lengthBytes, binaryNew( 4 ) );

		// Return the first 4-bytes.
		return( binarySlice( bytesForStorage, 1, 4 ) );

	}


	/**
	* I compress the given input using the ZLIB compression library.
	*
	* @input I am the string being deflated.
	* @bufferSize I am the size of the buffer to use while deflating.
	* @output false
	*/
	private binary function zlibCompress(
		required string input,
		numeric bufferSize = DEFAULT_BUFFER_SIZE
		) {

		var buffer = binaryNew( bufferSize );
		var inputBytes = charsetDecode( input, "utf8" );
		var byteStream = createObject( "java", "java.io.ByteArrayOutputStream" ).init();
		var deflater = createObject( "java", "java.util.zip.Deflater" ).init();

		deflater.setInput( inputBytes );
		deflater.finish();

		var byteCount = deflater.deflate( buffer );

		// Continue to pump the deflated bytes into the byte stream while there is still
		// input data to be deflated.
		while ( byteCount ) {

			byteStream.write( buffer, javaCast( "int", 0 ), javaCast( "int", byteCount ) );

			byteCount = deflater.deflate( buffer );

		}

		deflater.end();

		return( byteStream.toByteArray() );

	}


	/**
	* I uncompress the given binary using the ZLIB compression library.
	*
	* @input I am the binary value being inflated.
	* @bufferSize I am the size of the buffer to use while inflating.
	* @output false
	*/
	private string function zlibUncompress(
		required binary input,
		numeric bufferSize = DEFAULT_BUFFER_SIZE
		) {

		var outputBuffer = binaryNew( bufferSize );
		var byteStream = createObject( "java", "java.io.ByteArrayOutputStream" ).init();
		var inflater = createObject( "java", "java.util.zip.Inflater" ).init();

		inflater.setInput( input );

		var byteCount = inflater.inflate( outputBuffer );

		// Continue to pump the inflated bytes into the byte stream while there is still
		// input data to be inflated.
		while ( byteCount ) {

			byteStream.write( outputBuffer, javaCast( "int", 0 ), javaCast( "int", byteCount ) );

			byteCount = inflater.inflate( outputBuffer );

		}

		inflater.end();

		return( byteStream.toString( javaCast( "string", "UTF-8" ) ) );

	}

}

This ColdFusion component encapsulates three different areas of responsibility:

It understands the MySQL representation of the compressed data (and how to create and parse it).
It understands how to manipulate binary values / byte arrays.
It understands how to execute ZLIB inflation and deflation workflows.

Ideally, this should be three different components, with the latter two broken out into some sort of "Binary Utilities" and "ZLib Proxy" implementations. And, if this were a full-on application, this is exactly how I'd package them; however, for the sake of simplicity in a simple demo, it made more sense to just keep everything in one easy-to-test ColdFusion component.

To be honest, I'm not exactly sure what the use-case would be for implementing MySQL's Compress() and Uncompress() functions in ColdFusion (or Java). Perhaps it would be a cost and performance savings to both decrease the size of the data coming down over the wire (from the database) and to push the cost of inflating the binary data onto the application servers. Regardless, the idea definitely was fun to explore.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3152

Reader Comments

Ben Nadel Sep 30, 2016 at 8:18 AM

15,841 Comments

@All,

When I was researching this, I came across an article that said I should try to avoid ByteArrayOutputStream for performance reasons. Since I am using the ByteArrayOutputStream in this exploration, I wanted to do a quick follow-up exploration of how ByteBuffers work:

www.bennadel.com/blog/3155-experimenting-with-bytebuffer-in-coldfusion-for-binary-manipulation.htm

ByteBuffers are pre-allocated and are supposed to be much more performance. This demo could easily be re-worked to use ByteBuffer instead of ByteArrayOutputStream.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.