Extracting Illegal Tag Names From AntiSamy Error Messages In ColdFusion
Yesterday, I observed a number of errors on my blog in which a reader was attempting to post a comment with unsupported HTML content. Only, the error was not intentional—the reader was making reference to the ColdFusion tags, <cfquery>
and <cfqueryparam>
; and, my AntiSamy HTML sanitization workflow was interpreting these tokens as "HTML tags". This is an easy markdown error to fix—you wrap the tags in back-ticks to denote them as code. Only, my error reporting offered up zero insight into the underlying problem. As such, I went in and updated my sanitization workflow to extract the illegal tags from the OWASP AntiSamy error messages and report them to the reader.
Unfortunately, AntiSamy doesn't exactly surface the list of illegal tags—it just gives you a list of error messages in the form of:
"The cfquery tag has been filtered for security reasons. The contents of the tag will remain in place."
"The cfqueryparam tag has been filtered for security reasons. The contents of the tag will remain in place."
In order to get the list of tags, I have to parse the error message strings. This is always a sub-optimal approach because it's tightly coupled to the structure of the error message. And, if the error message structure were ever to change in a subsequent release of the AntiSamy project, my parsing would start failing; and, it would likely do so in a silent way that doesn't throw an error.
That said, I don't have many options. As such, the sub-optimal approach is now my approach of choice. The good news is, the error message structure appears to be consistent and easy to work with. All of the illegal tags are reported using the text pattern:
"The (tag name) tag ..."
We can use a regular expression (RegEx) pattern to locate this string of tokens, abstracting the tag name as a sequence of non-space characters (\S+
):
\bthe \S+ tag\b"
And, once we have this set of tokens extracted, we can think of it as a space-delimited list with our tag name as the second element in the list.
As I was playing around with this tag name extraction, I ran into one quirk: tags with dashes (-
) in them, such as <foo-bar>
, get reported with the dash represented as an HTML entity: -
. This will likely mean nothing to the reader (who's posting the comment). As such, I'm going to use ColdFusion's canonicalize()
function to decode all of the text within the tag name.
Here's a truncated copy of my HTML Sanitization ColdFusion component. As part of the .scan()
method, I'm taking the array of error messages and extracting the list of illegal tag names via getIllegalTagsFromErrors()
:
component hint="HtmlSanitizationService" {
// ... truncated ...
/**
* I scan the given untrusted HTML and return a sanitized version along with the
* validation errors that were encountered during the scan.
*/
public struct function scan( required string unsafeHtml ) {
var scanResult = antisamy.scan( javaCast( "string", unsafeHtml ), policy );
var html = scanResult.getCleanHTML();
var errors = scanResult.getErrorMessages();
var illegalTags = getIllegalTagsFromErrors( errors );
return {
html: html,
errors: errors,
illegalTags: illegalTags
};
}
// ---
// PRIVATE METHODS.
// ---
/**
* I extract the list of illegal tags from the given error messages.
*
* Caution: This relies on pattern matching in the error text. This is generally not a
* great practice since the error message text may change in future updates and this
* method will stop working (in subtle ways). But, it's better than nothing.
*/
private array function getIllegalTagsFromErrors( required array errorMessages ) {
var illegalTags = errorMessages
.toList()
.lcase()
.reMatch( "\bthe \S+ tag\b" )
.map(
( match ) => {
// The text that we matched can be seen as a space-delimited list. And
// our illegal tag name is the second element in that list.
var rawTagName = match.listGetAt( 2, " " );
// If the tag name contained a dash, AntiSamy appears to escape /
// encode it as an HTML entity. As such, let's canonicalize the tag
// name so that it becomes easier to read for the user.
return canonicalize( rawTagName, false, false, false );
}
)
;
return illegalTags;
}
}
Now, in my blog's error handling logic, I can take that collection of illegalTags
and roll it into the error message that I present to the user:
I'm hoping this reduces friction for my dear readers. And, I apologize to the person who was trying to post the comment yesterday—it seems that you never figured out what the error was and didn't post your comment (a result of my shoddy user experience).
Want to use code from this post? Check out the license.
Reader Comments
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →