Skip to main content
Ben Nadel at the jQuery Conference 2010 (Boston, MA) with: Jonathan Sharp
Ben Nadel at the jQuery Conference 2010 (Boston, MA) with: Jonathan Sharp

Testing Which ASCII Characters Break JSON (JavaScript Object Notation) Parsing

By
Published in , Comments (12)

This title of this blog post is a bit of a misnomer, so take it with a grain of salt. Basically, I have found that in my AJAX-driven applications, certain characters, embedded within a JSON (JavaScript Object Notation) payload, break the parser and throw a syntax error. The character limitations are not the same everywhere - AJAX-based JSON and inline JSON have different limitations. This is a test to discover, through brute force, which ASCII values are problematic.

AJAX-Driven JSON Testing

In this test, I am looking at which characters cause problems for the JSON parser when delivered in the HTTP response body of an AJAX request. This test is rather straightforward - I make a request for a given character, return said character in the serialized response, and then check to see if the response can be successfully parsed. If the response cannot be parsed, I know that the given ASCII character cannot be delivered in an AJAX-driven JSON payload.

So as not to overload the browser, I am making each AJAX request in serial using jQuery's Deferred functionality. Notice that inside the recursive function, I don't proceed to the next test until the AJAX promise has been either resolved or rejected. And, of course, if rejected, I am logging the problematic character:

<!doctype html>
<html>
<head>
	<meta charset="utf-8" />

	<title>
		Invalid JSON (JavaScript Object Notation) Characters In JSON.parse()
	</title>
</head>
<body>

	<h1>
		Invalid JSON (JavaScript Object Notation) Characters In JSON.parse()
	</h1>

	<!-- To log the failed characters. -->
	<textarea rows="20" cols="10" class="errorLog"></textarea>

	<script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.js"></script>
	<script type="text/javascript">

		var errorLog = $( "textarea.errorLog" ).val( "" );


		// Populate the ASCII values that we want to test. When it
		// comes to AJAX-based JSON, it's really only control
		// characters that matter (<32). However, just to cross our
		// tees, I'll go high to check high-ascii control characters
		// like the "paragraph seperator".
		var asciiValues = [];

		for ( var i = 0 ; i < 10000 ; i++ ) {

			asciiValues.push( i );

		}


		// This function will recurslively test each value, one at a
		// time so we don't overwhelm the browser with thousands of
		// AJAX requests. Notice that it doesn't call itself until the
		// promise has been resolved (or rejected).
		(function testNextValue() {

			if ( ! asciiValues.length ) {

				return;

			}

			// Get the next value off the array (pop from front).
			var charCode = asciiValues.shift();

			var promise = $.ajax({
				type: "get",
				url: "./ajax.cfm",
				data: {
					charCode: charCode
				},
				dataType: "json"
			});

			// When the AJAX request finishes, log any FAILURES.
			promise.then(
				testNextValue,
				function() {

					errorLog[ 0 ].value += ( charCode + "\n" );

					testNextValue();

				}
			);

		})(); // NOTE: Self-invokcation to kick-it-off!

	</script>

</body>
</html>

And, here is the code that the AJAX-driven JSON test is calling:

<cfscript>

	param name = "url.charCode" type = "numeric";

	// Build an object with the target ASCII value encoded as a string
	// character in one of the key values.
	result = {
		"charCode" = url.charCode,
		"data" = "[#chr( url.charCode )#]"
	};

	response = serializeJson( result );

</cfscript>

<cfcontent
	type="application/x-json; charset=utf-8"
	variable="#charsetDecode( response, 'utf-8' )#"
	/>

When this test finished, I found the following ASCII control characters to be problematic:

  • 1 - Start of Heading
  • 2 - Start of Text
  • 3 - End of Text
  • 4 - End of Transmission
  • 5 - Enquiry
  • 6 - Acknowledge
  • 7 - Bell
  • 11 - Vertical Tab
  • 14 - Shift Out
  • 15 - Shift In
  • 16 - Data Link Escape
  • 17 - Device Control 1
  • 18 - Device Control 2
  • 19 - Device Control 3
  • 20 - Device Control 4
  • 21 - Negative Acknowledge
  • 22 - Synchronous Idle
  • 23 - End of Trans. Block
  • 24 - Cancel
  • 25 - End of Medium
  • 26 - Substitue
  • 27 - Escape
  • 28 - File Separator
  • 29 - Group Separator
  • 30 - Record Separator
  • 31 - Unit Separator

Inline JSON Testing

This test is a little strange since I'm not really dealing with JSON so much as I am dealing with defining actual JavaScript objects. But, since this approach usually gets used in parallel with AJAX-driven delivery (ie, in the same application), I am grouping it under the same investigation.

In this test, we are using server-side code to serialize ColdFusion objects as JSON; then, we are using that JSON to generate JavaScript. To clarify, let me show you the individual test, rather than the test harness, first:

<cfscript>

	param name = "url.charCode" type = "numeric";

	// Build an object with the target ASCII value encoded as a string
	// character in one of the key values.
	result = {
		"charCode" = url.charCode,
		"data" = "[#chr( url.charCode )#]"
	};

</cfscript>

<cfoutput>

	<!doctype html>
	<html>
	<head>
		<script type="text/javascript">

			var x = #serializeJson( result )#;

			// This will only be called IF the above JSON does not
			// break the parsing.
			window.top.logSuccess( #url.charCode# );

		</script>
	</head>
	<body>
		<!-- Intentionally left blank. -->
	</body>
	</html>

</cfoutput>

As you can see, the intent is the same as the previous test; however, instead of returning the JSON in an HTTP response, we are outputting it directly in the HTML.

Here is the code that loads the test using an IFrame. Just as before, we are running each test in serial so as not to lock up the browser (thees tests take several minutes to run through 10,000 ASCII values).

<!doctype html>
<html>
<head>
	<meta charset="utf-8" />

	<title>
		Invalid Inline JSON (JavaScript Object Notation) Characters
	</title>
</head>
<body>

	<h1>
		Invalid Inline JSON (JavaScript Object Notation) Characters
	</h1>

	<!-- To log the failed characters. -->
	<textarea rows="20" cols="10" class="errorLog"></textarea>

	<script type="text/javascript" src="//cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.js"></script>
	<script type="text/javascript">

		var errorLog = $( "textarea.errorLog" ).val( "" );
		var body = $( "body" );


		// Populate the ASCII values that we want to test. When it
		// comes to INLINE JSON, the rules about which characters are
		// bad are less clear (to me). So, we'll test a huge range.
		var asciiValues = [];

		for ( var i = 0 ; i < 10000 ; i++ ) {

			asciiValues.push( i );

		}


		// As the IFrames load, their last call is to log the success
		// of the given charCode. This will only get called if the
		// inline JSON does not cause a parsing error.
		function logSuccess( charCode ) {

			// Remove the charCode from the error log.
			errorLog.val(
				errorLog.val()
					.replace( new RegExp( "\\b" + charCode + "\\b\\s*", "" ), "" )
			);

		}


		// This function will recurslively test each value, one at a
		// time so we don't overwhelm the browser with thousands of
		// HTTP requests. Notice that it doesn't call itself until the
		// iframe has been loaded.
		(function testNextValue() {

			if ( ! asciiValues.length ) {

				return;

			}

			// Get the next value off the array (pop from front).
			var charCode = asciiValues.shift();

			// Assume that this charCode will fail and append it to
			// the error log. This way, the charCode will only be
			// flagged as OK if the inline JSON doesn't cause a
			// parsing error.
			errorLog[ 0 ].value += ( charCode + "\n" );

			var frame = $( "<iframe></iframe>" )
				.prop( "src", ( "./inline.cfm?charCode=" + charCode ) )
				.height( 1 )
				.width( 1 )
			;

			// When the IFrame has loaded, we can assume this single
			// test is complete.
			frame.load(
				function( event ) {

					frame.remove();

					testNextValue();

				}
			);

			// Append to the body so the IFrame loads.
			body.append( frame );

		})(); // NOTE: Self-invokcation to kick-it-off!

	</script>

</body>
</html>

When this test finished, I found the following ASCII control characters to be problematic when delivered as inline JSON:

  • 8232 - Line Separator
  • 8233 - Paragraph Separator

Anyway, this was more for personal understanding of what precautions I have to take when delivering user-provided content in my AJAX-driven web applications. I did this because I've had users that will tell me such-and-such page is breaking for them; and, after further investigation, I've found the problem to be some ASCII character that is not "valid" inside the given usage context.

Want to use code from this post? Check out the license.

Reader Comments

15,902 Comments

@All,

I should mention that when the "inline" test breaks, it breaks with the following JavaScript error:

SyntaxError: JSON Parse error: Unterminated string

... just adding that for some Google love :)

15,902 Comments

Also, I guess I should note that this experiment clearly leans heavily on the ColdFusion implementation of serializeJson(). It's possible that other JSON serializers will handle the problematic characters more gracefully. I'll try to devise some tests that add some more explicit "escaping"... but, time to start work :(

15,902 Comments

@Zac,

I started to noticed it on CF 9.0.1... but I ran all these tests on CF 10. But, the issue isn't so much in CF as it's a parsing error in JS. That said, it would be cool if serializeJson() would escape those characters. I will have to look at the spec - JSON. I can't remember if the control characters need to escaped... or if they are simply not allowed.

15,902 Comments

@Senz,

True indeed. I tried to call this out at the top of the "Inline" section:

This test is a little strange since I'm not really dealing with JSON so much as I am dealing with defining actual JavaScript objects.

... but, these two concepts seem very much related and often executed, at least in my experience, in the same apps. That said, even from my tests, it becomes clear that the limitations are different.

15,902 Comments

@Pat,

When I first came across this issue, it was driving me NUTS!! The problem was, I was using reReplace() just as you were (and still am). But, at first, I tried to use the hexadecimal patterns to match the characters. According to this page, the line-separator:

http://www.fileformat.info/info/unicode/char/2028/index.htm

... has the decimal 8232, but the hex value 2028. But, using the Java Pattern class, I could never match it using "\u2028".

My understanding of the world is rather weak once you leave decimal format :) Anyway, it took me several days just to identify the characters in question! Such a pain.

That said, note that the 8232/8233 characters did not break JSON parsing - they only broke inline JavaScript code. Minor, but important difference in reference to your bug report.

2 Comments

Many thanks Ben. I seem to remember having a similar issue with those two characters with JavaScript code and TinyMCE container content! I think it drove us NUTS for a couple of days! I will update my comments on the ticket.

15,902 Comments

@Pat,

Good stuff! Trying to debug this stuff is tough; trying to debug this stuff when it's happening to one of your "users" is beyond aggravating!!

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel