SerializeJson() And "The Input And Output Encodings Are Not Same" Errors In ColdFusion
Over the weekend, I was hit by a very peculiar bug in ColdFusion. I was taking binary data, encoding it with the Base64 character set, and then serializing it as part of a larger JSON (JavaScript Object Notation) payload. The serialization went fine (seemingly); but, when I went to deserialize and decode the binary value, I got the following ColdFusion error:
The input and output encodings are not same. Ensure that the input was encoded with the same encoding as the one you are using to decode it with.
Though not terribly insightful, this error means that I'm trying to decode base64 data that contains characters outside of the base64 character set.
After "diff'ing" a bunch of before and after values, I finally narrowed it down to a single character string that was being incorrectly replaced during the serialization process. For some reason, the character string, "u+" was being replaced with "\u" during serialization. It doesn't happen for all binary values; but, I was able to find the one file in my task that was causing problems:
<cfscript>
originalBinary = fileReadBinary( expandPath( "./data.png" ) );
// Convert the byte array to a string.
encodedBinary = binaryEncode( originalBinary, "base64" );
// Encode the string for storage as JSON (JavaScript Object Notation).
// --
// NOTE: There is no implicit value in converting base64-encoded values to JSON in
// this particular demo. But, the point of the post is to look at serializeJson().
// So. just imagine that this base64-encoded value is part of a larger data
// structure that needs to be serialized.
serializedBinary = serializeJson( encodedBinary );
// ---
// Deserialize the data base into the "original" base64-encoded value.
deserializedBinary = deserializeJson( serializedBinary );
// Try to convert it back to a byte array.
binaryCopy = binaryDecode( deserializedBinary, "base64" );
</cfscript>
In this sample, all I'm doing is taking the binary value, encoding it, serializing it, deserializing it, and then attempting to decode it. Of course, since the file is known to be problematic, I get the error mentioned above:
The input and output encodings are not same.
While this became symptomatic when storing binary data as base64, the problem is not limited to base64 data; the problem occurs with any string that contains "u+" (followed by at least 4 digits) and needs to be serialized using JSON. To demonstrate, I'll encode a simple string value:
<cfscript>
// Create an input string that has two different uses of the "u" character.
input = "Hello \u1111 u+2222 world";
// Output the original value.
writeOutput( input & "<br />" );
// Output the serialized value produced by serializeJson().
// --
// CAUTION: The "u+" string will be accidentally replaced with "\u".
writeOutput( serializeJson( input ) );
</cfscript>
When I run this code, I get the following page output:
Hello \u1111 u+2222 world
"Hello \\u1111 \u2222 world"
As you can see, the "u+2222" value is incorrectly replaced with "\u2222".
While I had never run into this problem before, apparently it's a known bug in the Adobe Bug base. I could have just left it there; but, today is the 8th annual Regular Expression day. So, why not use the blindly awesome and unrelentingly magical power of regular expressions to solve this problem!
Ultimately, the fix is to manually [programmatically] replace the incorrectly-encoded value, "\u" with the original input value, "u+". However, we don't want to accidentally replace valid occurrences of "\u". If you look at the demo above, you'll see that it has two occurrences of "\u", but only one of them is problematic. The valid ones are the ones that include escaped slashes. So, when we go to execute our replacement, we can define a pattern that will only match "\u" if is preceded by an unrelated value:
<cfscript>
// Create an input string that has several different uses of the "u" character.
// Several of them will be *incorrectly* replaced by the serializeJson() function.
input = "\\u+1234 hello \u1111 , \u+2222 , u+2222 world";
writeOutput( input & "<br />" );
serializedInput = serializeJson( input );
writeOutput( serializedInput & "<br />" );
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
// When the input is serialized, any existing "\" is escaped. Therefore, when we fix
// the serialization bug, we can use a Regular Expression to make sure that we only
// mutate the portions of the serialized value that need to be fixed and don't
// accidentally get too greedy (pun intended) with what we replace. To help explain
// the regular expression, I'm gong to write it to an output buffer in parts.
savecontent variable = "pattern" {
// The verbose flag allows me to include non-significant whitespace in my pattern.
// Y'all know how much I love my whitespace.
writeOutput( "(?x)" );
// If the "\u" value is preceded by a "\", we have to make sure that the given
// slash is part of its own escaped block. As such, the "\u" either has to be
// preceded by no slashes; or, it has to be preceded by an even number of slashes
// that are all escaped.
// --
// ^ - Start of string.
// [^\\] - Any non-slash character.
// (?:\\\\)+ - Non-capturing group, matches an even-number of back-slashes.
// --
// NOTE: We have to escape the "\" as "\\" in the regular expression.
writeOutput( "( ^ | [^\\] | (?: \\ \\ )+ )" );
// This is the "broken" value that we need to replace. We'll be replacing the
// "\u" pattern with the intended value, "u+".
// --
// NOTE: We have to escape the "\" as "\\" in the regular expression.
writeOutput( "\\u" );
}
// Clean up the serialized value using the underlying Java regular expression methods
// on the String instance. These are much much faster (and more flexible) that the
// reReplace() methods provided by ColdFusion.
// --
// NOTE: The "$1" in the replacement string indicates the first captured group in the
// regular expression match, which is the first writeOutput() in the above buffer.
safeOutput = javaCast( "string", serializedInput ).replaceAll(
javaCast( "string", pattern ),
javaCast( "string", "$1u+" )
);
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
writeOutput( safeOutput & "<br />" );
// At this point, when we deserialize the value, it should match the original input
// value since we have manually corrected the serialization errors.
writeOutput( deserializeJson( safeOutput ) );
</cfscript>
In the "verbose" pattern above, I'm going to replace any instance of "\u" that is either:
- At the start of the string.
- Preceded by a non-slash character.
- Preceded by an even number of slash characters (to make sure that one of the preceding slashes isn't escaping the "\" in "\u").
When we run the above code, we get the following page output:
\\u+1234 hello \u1111 , \u+2222 , u+2222 world
"\\\\\u1234 hello \\u1111 , \\\u2222 , \u2222 world"
"\\\\u+1234 hello \\u1111 , \\u+2222 , u+2222 world"
\\u+1234 hello \u1111 , \u+2222 , u+2222 world
As you can see, the first and last lines match, indicating that value successfully made it through the entire JSON serialization lifecycle intact.
I ran into this bug in ColdFusion 10. And, the bug in the Adobe Bug base is still flagged as "to be fixed" in ColdFusion 11. So, hopefully, if you run into this problem, you can leverage the awesomeness of regular expressions to help save the day! I'm going to try to build this feature into my ColdFusion JSONSerialization.cfc component project.
UPDATE: June 2, 2015 -- It's Not Quite That Simple
Ok, apparently this ColdFusion bug is a little insidious than I had first realized. After I had put this "fix" in place, ColdFusion stopped throwing errors; but, I started to see some small data corruption. After looking at the deserialized Base64 values, in BeyondCompare for Mac, I noticed that some of the "u" characters were mismatched. What I then realized was that serializeJson() was creating a single output from two different inputs:
- U+1234 to \u1234
- u+1234 to \u1234
Notice that both inputs, lowercase "u" and uppercase "U", are both converted to the same serialized value, "\u". This means that after serialization, there is no way for us to know the original case of the input. This means that we cannot use a simple find-replace to "fix" the serialization bug after the fact.
If you are serializing a simple string, you could inject a marker before serialization and then extract it after serialization. But, once you need to serialize a structure of any size or complexity, this approach quickly becomes untenable. Really, I think the only viable solution, short of Adobe actually fixing this bug, is to use something like JsonSerializer.cfc, which can "own" the entire serialization mechanism. I haven't put a fix in for it yet; but, I hope to do that soon.
Want to use code from this post? Check out the license.
Reader Comments
Also, if you are using the older toString(), toBase64(), toBinary() functions, this may manifest with a slightly different error message:
> The parameter 1 of function ToBinary, which is now XYZ... must be a base-64 encoded string.
Ok, so this problem turns out to be even more devious than I originally understood. I'm still noodling on it; but, the quick gist is that two different inputs can produce the same output:
u+1234 --> \u1234
U+1234 --> \u1234
The first input is lowercase "u", the second input is uppercase "U". But result in the same janky JSON output. The problem with this is that I can't know which value was used in the input. This means that if I just assume it was lowercase "u", some percentage of the conversion will be wrong.
The particularly annoying part of this is that the conversion will likely "work" in that the data can be deserialized and used. But, the data will be ever so slightly wrong. Which means that the corruption may go unnoticed.
Also, it looks like this is a relatively new bug - not a long-standing bug. On a version of ColdFusion 10.0.13, I cannot reproduce the problem. I think I read that it was introduced with patch 14.
As a follow-up, I wanted to see what it would look like to manually execute JSON serialization on an input string:
www.bennadel.com/blog/2843-manually-serializing-a-string-using-json-encoding-in-coldfusion.htm
This is probably the kind of approach that I'll have to build into JSONSerializer.cfc.
https://bugbase.adobe.com/index.cfm?event=bug&id=3941059
@Jonas,
Ha ha, looks like there's more than one ticket around it - here is the one I referenced in my blog post:
https://bugbase.adobe.com/index.cfm?event=bug&id=3837347
Yes, I've commented in that one as well. I didn't find it until I raised my bug report.
I think it's scary that:
#1 - Adobe can completely misread a simple specification like the JSON specification.
#2 - Adobe still seems to think that they have implemented the JSON specification correctly.
#3 - Adobe doesn't see a problem with getting something different back after a serialization->deserialization round.
#4 - Still haven't fixed the damn thing after several months.
I'm a bit frustrated with Adobe.
@Jonas,
Above and beyond all of that, I think the absolute strangest part of all of this is that it seems to be a recently-emergant bug. Meaning, when I run u+1234 through ColdFusion 10.0.13, it works fine. From what I think I saw in the bug report(s), this problem started in .14. Which means that, at some point, someone was like, "Hey, this is broken - let's **fix** it." So, what I want to know is, where's the bug report that says it was broken previously?
That said, I think serializeJson() has had a lot of bugs in the past; so, it's possible this change just got folded into other attempts to fix stuff.
To be completely blunt, the world basically runs on JSON these days - it is *the* preferred data encoding format for inter-system communication (becoming more popular every day). The fact that ColdFusion doesn't get this right is quite possibly the most frustrating thing about ColdFusion. If I had to pick a straw that broke the camel's back, it would be the lack of JSON correctness.
They broke it when they tried, and failed, to fix this bug: https://bugbase.adobe.com/index.cfm?event=bug&id=3561029
Every single patch seems to contain a couple of "fixes" for SerializeJSON.
@Jonas,
So sad :(