Using The OWASP AntiSamy 1.5.7 Project With ColdFusion 10 To Sanitize HTML Input And Help Prevent XSS Attacks
For the past few days, I've been working to enable Markdown for my blog comments. Of course, the second I enable Markdown, I allow my readers to submit a wider variety of content. In order to ensure that said content doesn't contain malicious or ill-advised code, I wanted to add a subsequent layer of validation and sanitization. I took a look at OWASP (Open Web Application Security Project) to see what they recommend; which is where I discovered the OWASP AntiSamy Project. AntiSamy allows for untrusted HTML to be evaluated and sanitized using a custom security Policy. Unfortunately, loading AntiSamy 1.5.7 (the latest version at the time of this writing) into a ColdFusion application isn't effortless. It requires Mark Mandel's JavaLoader project; and, a few Class Loading shenanigans.
View this code in my AntiSamy 1.5.7 With ColdFusion 10 project on GitHub.
First off, I want to give a special shout-out to Matthew J. Clemente and his post about AntiSamy 1.5.3. He set me down the right path. I just needed to workout the differences between his use of 1.5.3 and my use of 1.5.7 - which, ironically, uses a non-breaking semver (Semantic Versioning) version that clearly causes breaking changes of some sort.
That said, the OWASP AntiSamy project uses an XML-based security Policy file to evaluate and sanitize untrusted, user-provided HTML. The XML policy file can be very relaxed; or, it can be very strict. The project contains a few sample XML files that are based on some real-world context. Of course, you can create your own Policy file with whichever rules you feel make sense.
For example, I created one for this demo that is very strict, and strips out all but the most basic text formatting tags:
<?xml version="1.0" encoding="UTF-8" ?>
<anti-samy-rules
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="antisamy.xsd">
<directives>
<directive name="embedStyleSheets" value="false" />
<directive name="formatOutput" value="true" />
<directive name="maxInputSize" value="100000" />
<directive name="nofollowAnchors" value="true" />
<directive name="omitDoctypeDeclaration" value="true" />
<directive name="omitXmlDeclaration" value="true" />
<directive name="onUnknownTag" value="remove" />
<directive name="useXHTML" value="true" />
</directives>
<tag-rules>
<tag name="a" action="validate">
<attribute name="href" onInvalid="filterTag">
<regexp-list>
<regexp value="https?://[A-Za-z0-9]+[~a-zA-Z0-9-_\.@\#\$%&;:,\?=/\+!\(\)]*" />
</regexp-list>
</attribute>
<attribute name="rel">
<literal-list>
<literal value="nofollow" />
</literal-list>
</attribute>
</tag>
<tag name="b" action="validate" />
<tag name="blockquote" action="validate" />
<tag name="code" action="validate">
<attribute name="class">
<regexp-list>
<regexp value="language-[a-zA-Z0-9]+" />
</regexp-list>
</attribute>
</tag>
<tag name="em" action="validate" />
<tag name="i" action="validate" />
<tag name="li" action="validate" />
<tag name="ol" action="validate" />
<tag name="p" action="validate" />
<tag name="pre" action="validate">
<attribute name="class">
<regexp-list>
<regexp value="language-[a-zA-Z0-9]+" />
</regexp-list>
</attribute>
</tag>
<tag name="strong" action="validate" />
<tag name="ul" action="validate" />
</tag-rules>
</anti-samy-rules>
By default, this AntiSamy policy will strip out all tags that are not explicitly listed in the tag rules (but will keep their content). This means that only tags like P, Strong, Blockquote, Code, and Pre will be allowed to pass through. But, even the tags that are allowed to pass through are still sanitized based on attribute validation rules. For example, the Code and Pre tags can have the attribute, "class"; but, only if it adheres to the given Regular Expression pattern.
NOTE: From what I can see, the only other valid value for "onUnknownTag" is "encode", which will escape the characters of unknown tags, rather than strip them out.
Understanding the Policy XML file schema is complicated. I barely understand it. But, I did find a good description in the WaveMaker's Learning Center that breaks it down fairly well.
Now, the primary issue with using AntiSamy 1.5.7 in ColdFusion 10 is that something in the internals of the library (or its dependencies) ends up using the wrong Class Loader and leads to the following error:
java.lang.ClassCastException: org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
Thankfully, the JavaLoader project has a special method to deal with this very problem: switchThreadContextClassLoader(). This is the same method that I used when loading LaunchDarkly's feature flag library into ColdFusion 10. This method executes an arbitrary function inside a context that prevents the executing code from reaching into the wrong Class Loader. It makes the code a bit harder to read; but, it gets the job done.
To see this in action, let's sanitize some HTML! First, we need to create our JavaLoader for AntiSamy 1.5.7, which I'm caching in the ColdFusion Application scope during application initialization:
component
output = false
hint = "I provide the application settings and event handlers."
{
// Define the application.
this.name = hash( getCurrentTemplatePath() );
this.applicationTimeout = createTimeSpan( 0, 0, 10, 0 );
this.sessionManagement = false;
// Setup the mappings
this.directory = getDirectoryFromPath( getCurrentTemplatePath() );
this.mappings[ "/" ] = this.directory;
this.mappings[ "/antisamy" ] = ( this.directory & "vendor/antisamy-1.5.7/" );
this.mappings[ "/javaloader" ] = ( this.directory & "vendor/javaloader-1.2/javaloader/" );
this.mappings[ "/javaloaderfactory" ] = ( this.directory & "vendor/javaloaderfactory/" );
// ---
// PUBLIC METHODS.
// ---
/**
* I initialize the application.
*
* @output false
*/
public boolean function onApplicationStart() {
// In order to prevent memory leaks, we're going to use the JavaLoaderFactory to
// instantiate our JavaLoader. This will keep the instance cached in the Server
// scope so that it doesn't have to continually re-create it as we test our
// application configuration.
application.javaLoaderFactory = new javaloaderfactory.JavaLoaderFactory();
// Create a JavaLoader that can access the AntiSamy 1.5.7 JAR files.
// --
// CAUTION: The directory has MORE JAR FILES than are actually necessary to run
// the demo. However, I just downloaded all the non-optional dependencies
// according to the MAVEN resource pages. I don't actually know enough about Java
// to know which libraries I can and cannot exclude from the JavaLoader. I have
// commented-out the ones that were not supplied by the "JAR Download" website.
application.antisamyJavaLoader = application.javaLoaderFactory.getJavaLoader([
expandPath( "/antisamy/antisamy-1.5.7.jar" ),
// expandPath( "/antisamy/avalon-framework-4.1.3.jar" ),
// expandPath( "/antisamy/avalon-framework-4.1.5.jar" ),
expandPath( "/antisamy/batik-constants-1.9.1.jar" ),
expandPath( "/antisamy/batik-css-1.9.1.jar" ),
expandPath( "/antisamy/batik-i18n-1.9.1.jar" ),
expandPath( "/antisamy/batik-util-1.9.1.jar" ),
expandPath( "/antisamy/commons-codec-1.6.jar" ),
expandPath( "/antisamy/commons-io-1.3.1.jar" ),
// expandPath( "/antisamy/commons-logging-1.0.4.jar" ),
expandPath( "/antisamy/commons-logging-1.1.3.jar" ),
expandPath( "/antisamy/httpclient-4.3.6.jar" ),
expandPath( "/antisamy/httpcore-4.3.3.jar" ),
// expandPath( "/antisamy/log4j-1.2.17.jar" ),
// expandPath( "/antisamy/logkit-1.0.1.jar" ),
expandPath( "/antisamy/nekohtml-1.9.22.jar" ),
expandPath( "/antisamy/xercesImpl-2.11.0.jar" ),
expandPath( "/antisamy/xml-apis-1.4.01.jar" ),
expandPath( "/antisamy/xml-apis-ext-1.3.04.jar" ),
// expandPath( "/antisamy/xml-resolver-1.2.jar" ),
expandPath( "/antisamy/xmlgraphics-commons-2.2.jar" )
]);
// Indicate that the application has been initialized successfully.
return( true );
}
}
To be honest, I don't really know that much about Java. I love using random Java utilities, like the Pattern / Matcher classes; but, my understanding of how Java applications execute is very shallow. For example, I don't understand why not all of the Java JAR files are necessary to run this demo. To get started, I went to the Maven project page for AntiSamy 1.5.7 and just manually downloaded all of the non-optional dependencies. But, if I compare my list of files to the one in the ZIP file provided by the JAR Download site, they are different. As such, I commented-out the ones that were not present in the JAR Download ZIP.
CAUTION: I don't know if JAR Download is a legitimate site. I trust Maven, so I'll happily grab files from their site. However, I am not sure if JAR Download is a trusted resource - use those ZIP files with caution. I only used it for a comparison of what files were made available.
Once we have our AntiSamy JavaLoader instance cache, we can start looking at user-provided HTML. Here's where we have to jump through some obscure Class Loading hoops. In the following code, notice that I have to load my XML Policy file using the switchThreadContextClassLoader() method:
<!--- Setup our untrusted HTML content. --->
<cfsavecontent variable="unsafeHtml">
<p>
Check out
<a href="https://www.bennadel.com" onmousedown="alert( 'XSS!' )">my site</a>.
</p>
<marquee loop="-1" width="100%">
I am very trustable! You can totes trust me!
</marquee>
<p>
<strong>Thanks for stopping by!</strong> <em>You Rock!</em>
<blink>Woot!</blink>
</p>
</cfsavecontent>
<!--- ------------------------------------------------------------------------------ --->
<!--- ------------------------------------------------------------------------------ --->
<cfscript>
// Create our AntiSamy instance.
// --
// NOTE: We would probably cache this in the Application scope. Or, more likely,
// inside a proxy Component that handles all of the intricate interaction details
// for us so that we don't have know about all the junk below.
antisamy = application.antisamyJavaLoader.create( "org.owasp.validator.html.AntiSamy" ).init();
// Create our security policy from the given XML file. This policy determines what
// tags and attributes will be allowed in the sanitized HTML; and, about how AntiSamy
// will treat invalid tags that it comes across.
// --
// NOTE: We would probably cache this so that we don't have re-read it every time.
// --
// Read More: https://www.wavemaker.com/learn/app-development/app-security/xss-antisamy-policy-configuration/
policy = application.antisamyJavaLoader.switchThreadContextClassLoader(
getInstance__inProperContext,
{
PolicyClass: application.antisamyJavaLoader.create( "org.owasp.validator.html.Policy" ),
policyFilePath: expandPath( "./demo-policy.xml" )
}
);
// Scan the untrusted HTML. The results will contain both error messages and the
// sanitized HTML output.
result = antisamy.scan( javaCast( "string", unsafeHtml ), policy );
writeOutput( encodeForHtml( result.getCleanHTML() ));
writeOutput( "<hr />" );
writeDump( result.getErrorMessages() );
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
/**
* I am intended to be INVOKED BY THE JAVALOADER. I run the getInstance() method in a
* context that forces the classes to be loaded from the AntiSamy JavaLoader. This
* gets around issues in which Java classes try to load dependencies from the wrong
* Class Loader.
*
* NOTE: While in this method, you cannot access the core ColdFusion classes. As such,
* this method should do AS LITTLE AS POSSIBLE such that it can return to the normal
* execution context as fast as possible.
*
* @PolicyClass I am the Policy class provided by AntiSamy.
* @policyFilePath I am the path to our security policy XML file.
* @output false
*/
public any function getInstance__inProperContext(
required any PolicyClass,
required string policyFilePath
) {
return( PolicyClass.getInstance( javaCast( "string", policyFilePath ) ) );
}
</cfscript>
As you can see, we use the JavaLoader to create an instance of the AntiSamy library. For the demo, I'm just instantiating AntiSamy each time the page is run (to make the development easier). However, in a real-world scenario, I'd probably load and cache this library inside of another ColdFusion component that proxies the AntiSamy API. I'd also read and cache the Policy file so that I don't have to keep performing disk I/O.
That said, once we have our AntiSamy instance, we can use the .scan() method to evaluate the untrusted HTML. The .scan() method returns a result object that contains both the sanitized / clean HTML as well as a collection of error messages that outline how the untrusted HTML was manipulated. In this case, if we run the page, we get the following page output:
As you can see, the unknown tags, Marquee and Blink, were stripped out; but, their content was allowed to remain. That said, these two tags were listed in the errors collection. So, it's up to you as the developer to decide if you want to store the sanitized version. Or, if you want to kick it back to the user with the listed problems.
AntiSamy is a pretty cool project for content sanitization and the prevention of XSS (Cross-Site Scripting) attacks! It's not exactly clear how active the project is. But, considering that it's an OWASP project, I have to imagine that they will evolve it as necessary. That said, it's really just an HTML parser; so, it should naturally be able to adapt to changes in the HTML specification (as long as it continues to consist of Tags and Attributes). And, hopefully this post helps you understand how AntiSamy 1.5.7 might be loaded into a ColdFusion 10 application.
Want to use code from this post? Check out the license.
Reader Comments
Nice! That was a real rabbit hole... incredibly frustrating.
I tried a number of other approaches (fat jars, modified pom.xml builds, etc), none of which worked, so I'm glad to see you were able to get the dependencies sorted out and working.
Along with watching the AntiSamy project to see if they sort out their use of Xerces, I'm also going to explore this other OWASP project: https://www.owasp.org/index.php/OWASP_Java_HTML_Sanitizer_Project
Looks like it's not quite as robust as AntiSamy yet, but is being very actively developed.
Cheers!
@Matthew,
Oh man! How are we supposed to keep up with all this stuff?! I'll take a look as well. Though, what I'd really like is just to create some sort of Abstraction around sanitizing / validation HTML, so I could swap the Java libraries out under the hood.
Thanks again for all your help!
Informative, thanks for the detailed explanation!
@Mahendra,
Glad you enjoyed it. Using AntiSamy certainly adds a layer of confidence to accepting open-ended data from users.
Hi Ben,
it comes as a surprise you're still on CF 10. Later Versions load jar file much more easily through "this.javaSettings" in Application.cfc.
I recently cleaned HTML output using jsoup which does not need to be configured through such a "heavy" xml file.
I believed their release to be much more recent that Antisamy's but they obviously released versions in 2017, too, after some years without updates.
Antysamy returns a list of findings which is nice.
http://central.maven.org/maven2/org/owasp/antisamy/antisamy/
http://central.maven.org/maven2/org/jsoup/jsoup/
Best,
Bernhard
I studied your blog further and understood you found the maven downloads yourself.
When I try a later version of Adobe ColdFusion and use "this.javaSettings" I run into the same error
Error casting an object of type org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory to an incompatible type. This usually indicates a programming error in Java, although it could also mean you have tried to use a foreign object in a different way than it was designed.
org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to javax.xml.parsers.DocumentBuilderFactory
@Bernhard,
Yeah, I'm working on updating my version of ColdFusion. I may actually move over to Lucee as I'm actually paying $$$ for a hosted version of ColdFusion. We'll see -- it depends on how much I actually want to learn about managing a server (vs. being able to just open a Support ticket).
As far as using the ColdFusion settings for loading Java classes, I think that really only works when there are no conflicts with the dependencies. So, even in a later version of ColdFusion, I'd still use the JavaLoader project to load stuff like this. That way, you get total control over which classes are being used when. In fact, even with the JavaLoader project, you can see that I ran into problems and had to use the heavy-handed,
.switchThreadContextClassLoader()
method to force the proper context when loading the XML classes (which would, otherwise, throw the casting error you were seeing).Hi Ben,
Great post!
By any chance did you try to use the dynamic attributes with 1.5.7? I'm currently trying to implement this and got everything working except for the dynamic attributes, it keeps filtering out all of them, no matter what's in my policy file.
Thanks,
Landon
@Landon,
I have not tried any of the dynamic attributes yet. I assume you're talking about the
data-*
style tags? If so, I'll put it on my list of something to try. For now, I'm just locking stuff down really hard cause it's all user-provided.That said, I would like to eventually use Markdown as my post-authoring ingress. At that point, I'll need to have a lot more flexibility.
I think we can achieve the same result using Anitsamy 1.5.5 library that comes Coldfusion 2016. Main functions are isValidHTML() and getSafeHTML().
@Vikram,
Oh, very interesting, I did not know that newer versions of ColdFusion came with some of this functionality out-of-the-box. I'm still stuck on an older version of CF, so I am usually slow to see the new features. Hopefully I can upgrade sooner than later.
Hi Ben,
Cleaning user input is usually easy with built-in functions of CFML. But when when it comes to textareas and rich text editors. I'll try your approach, 'cause my homegrown version doesn't cut it facing all the thousands of ways harmful code can be encoded :-(
And @Vikram, too bad these 2 functions haven't made it 2 Lucee (yet)!
@Sebastiaan,
100%. Everything is easy-peasy when all you want to do is escape all of the user input - then all the
encodeForHtml()
methods, and related, work like a charm. The real issue comes about when you want to allow some markup, but escape most markup. So, for example, letting the user enter simple markdown (bold, italic, etc), but not letting them do anything else. That's when this AntiSamy stuff really comes into play.At work, I'd love to implement this a bit more; but, the problem is we have a lot of historic data that was not entered in this way! As such, I can't really "trust" any of it, since some of it would have been stored prior to the sanitization techniques. I'd have to migrate all the old data, running it through sanitization. And, ain't nobody got time for that right now :D
On a related note, I just tried using JSoup's sanitization workflow for the first time. Instead of using XML configuration files, it uses a jQuery-like fluent API to define the allow-list of elements, attributes and protocols:
www.bennadel.com/blog/4722-using-jsoup-to-sanitize-untrusted-html-in-coldfusion.htm
Definitely something worth peeking at.
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →