Skip to main content
Ben Nadel at cf.Objective() 2010 (Minneapolis, MN) with: Emily Meyer
Ben Nadel at cf.Objective() 2010 (Minneapolis, MN) with: Emily Meyer

Extracting Illegal Tag Names From AntiSamy Error Messages In ColdFusion

By
Published in

Yesterday, I observed a number of errors on my blog in which a reader was attempting to post a comment with unsupported HTML content. Only, the error was not intentional—the reader was making reference to the ColdFusion tags, <cfquery> and <cfqueryparam>; and, my AntiSamy HTML sanitization workflow was interpreting these tokens as "HTML tags". This is an easy markdown error to fix—you wrap the tags in back-ticks to denote them as code. Only, my error reporting offered up zero insight into the underlying problem. As such, I went in and updated my sanitization workflow to extract the illegal tags from the OWASP AntiSamy error messages and report them to the reader.

Unfortunately, AntiSamy doesn't exactly surface the list of illegal tags—it just gives you a list of error messages in the form of:

  • "The cfquery tag has been filtered for security reasons. The contents of the tag will remain in place."

  • "The cfqueryparam tag has been filtered for security reasons. The contents of the tag will remain in place."

In order to get the list of tags, I have to parse the error message strings. This is always a sub-optimal approach because it's tightly coupled to the structure of the error message. And, if the error message structure were ever to change in a subsequent release of the AntiSamy project, my parsing would start failing; and, it would likely do so in a silent way that doesn't throw an error.

That said, I don't have many options. As such, the sub-optimal approach is now my approach of choice. The good news is, the error message structure appears to be consistent and easy to work with. All of the illegal tags are reported using the text pattern:

"The (tag name) tag ..."

We can use a regular expression (RegEx) pattern to locate this string of tokens, abstracting the tag name as a sequence of non-space characters (\S+):

\bthe \S+ tag\b"

And, once we have this set of tokens extracted, we can think of it as a space-delimited list with our tag name as the second element in the list.

As I was playing around with this tag name extraction, I ran into one quirk: tags with dashes (-) in them, such as <foo-bar>, get reported with the dash represented as an HTML entity: &#45;. This will likely mean nothing to the reader (who's posting the comment). As such, I'm going to use ColdFusion's canonicalize() function to decode all of the text within the tag name.

Here's a truncated copy of my HTML Sanitization ColdFusion component. As part of the .scan() method, I'm taking the array of error messages and extracting the list of illegal tag names via getIllegalTagsFromErrors():

component hint="HtmlSanitizationService" {

	// ... truncated ...

	/**
	* I scan the given untrusted HTML and return a sanitized version along with the
	* validation errors that were encountered during the scan.
	*/
	public struct function scan( required string unsafeHtml ) {

		var scanResult = antisamy.scan( javaCast( "string", unsafeHtml ), policy );

		var html = scanResult.getCleanHTML();
		var errors = scanResult.getErrorMessages();
		var illegalTags = getIllegalTagsFromErrors( errors );

		return {
			html: html,
			errors: errors,
			illegalTags: illegalTags
		};

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I extract the list of illegal tags from the given error messages.
	* 
	* Caution: This relies on pattern matching in the error text. This is generally not a
	* great practice since the error message text may change in future updates and this
	* method will stop working (in subtle ways). But, it's better than nothing.
	*/
	private array function getIllegalTagsFromErrors( required array errorMessages ) {

		var illegalTags = errorMessages
			.toList()
			.lcase()
			.reMatch( "\bthe \S+ tag\b" )
			.map(
				( match ) => {

					// The text that we matched can be seen as a space-delimited list. And
					// our illegal tag name is the second element in that list.
					var rawTagName = match.listGetAt( 2, " " );

					// If the tag name contained a dash, AntiSamy appears to escape /
					// encode it as an HTML entity. As such, let's canonicalize the tag
					// name so that it becomes easier to read for the user.
					return canonicalize( rawTagName, false, false, false );

				}
			)
		;

		return illegalTags;

	}

}

Now, in my blog's error handling logic, I can take that collection of illegalTags and roll it into the error message that I present to the user:

Screenshot of user leaving a comment with unescaped tags; and an error message that properly reports said tag names.

I'm hoping this reduces friction for my dear readers. And, I apologize to the person who was trying to post the comment yesterday—it seems that you never figured out what the error was and didn't post your comment (a result of my shoddy user experience).

Want to use code from this post? Check out the license.

Reader Comments

Post A Comment — I'd Love To Hear From You!

Post a Comment

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel