Using jSoup To Sanitize Untrusted HTML In ColdFusion
For years, I've been using the OWASP AntiSamy project to sanitize untrusted HTML in ColdFusion. And for years, James Moberg has suggested that I just use JSoup. I'm not one to switch tools unnecessarily. However, when I went to install AntiSamy in a new project and remembered just how many JAR files were required, I figured it was time to look at JSoup's single JAR approach to cleaning and sanitizing HTML.
JSoup is an amazing Java library that brings the magic of jQuery's fluent API and effortless document object model (DOM) manipulation into the server-side world of ColdFusion. I use it quite heavily on this blog to post-process and normalize my content, extract OpenGraph tag data, insert <h2>
anchor links, and parse GitHub Gist data (just to name a few things). In short, it's been tremendously helpful.
But, I've only ever used it to operate on trusted HTML content—either content that I've written personally; or, content that's first been run through AntiSamy's sanitization process. JSoup exposes functionality similar to AntiSamy through the org.jsoup.safety
package and its Safelist
and Cleaner
classes.
I love the fact that JSoup doesn't use XML! While AntiSamy's XML-based configuration files work perfectly well, it feels a bit anachronistic in 2024. JSoup's answer to this is to provide a "builder pattern" that exposes a fluent API for defining an allowlist of tags, attributes, and protocols that can appear in the source document.
For example, if you only want to allow for simple text formatting, you could add just the Bold and Italic tags to the allowlist:
<cfscript>
safelist = create( "org.jsoup.safety.Safelist" )
.init()
.addTags([ "strong", "em" ])
;
</cfscript>
Of course, this would only work for a single-line of input text since paragraph tags will be stripped out. If you want to allow for multiple lines of content, you have to add things like Paragraph and List tags to the allowlist:
<cfscript>
safelist = create( "org.jsoup.safety.Safelist" )
.init()
.addTags([ "strong", "em" ])
.addTags([ "p", "ul", "ol", "li" ])
;
</cfscript>
You can also specify optional tag attributes, required tag attributes, and viable attribute protocols. To demonstrate this in more depth, let's define a file with an assortment of untrusted HTML content:
<p>
Hey, check out my <a href="javascript:void(0)">awesome site</a>!
</p>
<p>
It's <strong onclick="alert(1)" class="highlite">so <a>great</a>!</strong>
</p>
<p>
<malicious>This is fun, too</malicious>; you should <em>try it</em>.
</p>
<p>
Cheers, <a href="https://www.bennadel.com" target="_blank">Ben Nadel</a>
</p>
<script>
alert(1);
</script>
<pre class="language-js"><code class="so-cool">var x = prompt( "Get to the choppa!" );</code></pre>
Note that this untrusted HTML contains a javascript:
protocol, an onclick
attribute, a <malicious>
tag, and some unexpected class
attribute values. To remove these unwanted HTML constructs, we're going to:
- Define an instance of
Safelist
. - Add only the elements, attributes, and protocols that we want.
- Run the untrusted HTML through JSoup's static
.clean()
method.
Then, we'll output the sanitized version of the HTML in our ColdFusion response:
<cfscript>
// The Safelist class is used to define which elements, attributes, and protocols are
// allowed (and which are mandated). The Safelist class also provides some default
// configurations that are provided as static methods (ex, .simpleText(), .basic(),
// basicWithImages()).
safelist = create( "org.jsoup.safety.Safelist" )
.init()
// Add basic text formatting elements. This allows these these tag-names to be
// included; but, doesn't inherently allow for any tag attributes. At this point,
// any attributes in these tags are quietly stripped during the sanitization.
.addTags([ "strong", "b", "em", "i", "u" ])
// Add basic text layout elements. This allows these these tag-names to be
// included; but, doesn't inherently allow for any tag attributes.
.addTags([ "p", "blockquote", "ul", "ol", "li", "br" ])
// Add code block related elements.
.addTags([ "pre", "code" ])
// Code block elements usually have a language specification via the class.
.addAttributes( "pre", [ "class" ] )
.addAttributes( "code", [ "class" ] )
// Add anchor links.
.addTags([ "a" ])
.addAttributes( "a", [ "href" ] )
// We don't want to allow "javascript:" or "mailto:" protocols - allowlist only
// the standard http protocols.
.addProtocols( "a", "href", [ "http", "https" ] )
// Force every anchor link to have a security-oriented rel directives. Even if the
// input has an existing [rel] attribute, it will be overwritten.
// --
// Note: I'm omitting "nofollow" in the spirit of a connected web.
.addEnforcedAttribute( "a", "rel", "noopener noreferrer" )
;
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
untrustedHtml = fileRead( "./input.html", "utf-8" );
// Sanitize the untrusted HTML using our defined Safelist. This returns a string. If
// we need to return a DOM instance, we have to use the Clean class directly rather
// that the JSoup static method (we'll do this in the next demo).
trustedHtml = create( "org.jsoup.Jsoup" )
.clean( untrustedHtml, safelist )
;
echo( "<pre>" & encodeForHtml( trustedHtml ) & "</pre>" );
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
/**
* I create an instance of the given JSoup package class.
*/
private any function create( required string className ) {
var jarPaths = [ "./jsoup-1.18.1.jar" ];
return createObject( "java", className, jarPaths );
}
</cfscript>
If we run this ColdFusion code, parse the HTML, sanitize it, and then echo the resultant HTML string, we get the following output (I've added some whitespace for readability):
<p>
Hey, check out my <a rel="noopener noreferrer">awesome site</a>!
</p>
<p>
It's <strong>so <a rel="noopener noreferrer">great</a>!</strong>
</p>
<p>
This is fun, too; you should <em>try it</em>.
</p>
<p>
Cheers, <a href="https://www.bennadel.com" rel="noopener noreferrer">Ben Nadel</a>
</p>
<pre class="language-js">
<code class="so-cool">var x = prompt( "Get to the choppa!" );</code>
</pre>
This is a great first step! None of the malicious / block-listed content made it through the sanitization process. There's no javascript:
protocol, there's no onclick
attribute, and there's no <malicious>
element. But, the output is a little strange. Now, we have <a>
tags that don't really do anything. And, we have a <code>
element with class="so-cool"
, which isn't an expected language specification.
When we use JSoup's .clean()
method—as we did above—it uses the Cleaner
class under the hood; and then both parses the input and serializes the response as HTML. To exercise a little more granular control over the sanitization process, we can use the Cleaner
class directly. This will leave us with a sanitized Document Object Model (DOM) instead of a string. We can then use the DOM API to perform some additional clean-up tasks.
This time, after the Cleaner
class applies the Safelist
constraints, we're going to remove / unwrap any <a>
tags with no href
and we're going to remove the class
attribute from any <pre>
or <code>
tags if it (the class
attribute value) doesn't match a language specification.
Note that this is the same Safelist
specification as we used above; only I've removed all of the commenting for the sake of brevity.
<cfscript>
// Define our allow-listed markup.
safelist = create( "org.jsoup.safety.Safelist" )
.init()
.addTags([ "strong", "b", "em", "i", "u" ])
.addTags([ "p", "blockquote", "ul", "ol", "li", "br" ])
.addTags([ "pre", "code" ])
.addAttributes( "pre", [ "class" ] )
.addAttributes( "code", [ "class" ] )
.addTags([ "a" ])
.addAttributes( "a", [ "href" ] )
.addProtocols( "a", "href", [ "http", "https" ] )
.addEnforcedAttribute( "a", "rel", "noopener noreferrer" )
;
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
untrustedDom = create( "org.jsoup.Jsoup" )
.parse( fileRead( "./input.html", "utf-8" ) )
;
// This time, instead of using the JSoup static method, we're going to create an
// instance of the cleaner class. This allows us to both pass-in and receive a
// Document instead of a string which will allow us to do some post-processing.
trustedDom = create( "org.jsoup.safety.Cleaner" )
.init( safelist )
.clean( untrustedDom )
;
// At this point, our DOM contains a sanitized structure. However, this structure
// includes some funky elements - like anchor tags with no [href] attribute. Let's
// remove (unwrap) anchors that don't have an [href] attribute.
for ( element in trustedDom.select( "a" ) ) {
if ( element.attr( "href" ) == "" ) {
element.unwrap();
}
}
// We also might have pre and code elements with unsupported class attributes. Let's
// remove class attributes that don't match the regular expression for a language
// specification. This will leave the pre/code elements in place, but strip out the
// [class] attribute.
for ( element in trustedDom.select( "pre[class], code[class]" ) ) {
if ( ! element.attr( "class" ).reFind( "^language-[a-zA-Z0-9]+$" ) ) {
element.removeAttr( "class" );
}
}
// Now that we have a sanitized DOM with some additional post-processing, we can
// manually serialize it into a trusted HTML payload.
trustedHtml = trustedDom
.body()
.html()
;
echo( "<pre>" & encodeForHtml( trustedHtml ) & "</pre>" );
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
/**
* I create an instance of the given JSoup package class.
*/
private any function create( required string className ) {
var jarPaths = [ "./jsoup-1.18.1.jar" ];
return createObject( "java", className, jarPaths );
}
</cfscript>
With this approach, we have to explicitly parse and serialize the HTML. But, that gives an opportunity to work directly with the DOM; which, in turn, gives us much lower-level control. And, when we run this Lucee CFML code, we get the following HTML output (I've added whitespace for readability):
<p>
Hey, check out my awesome site!
</p>
<p>
It's <strong>so great!</strong>
</p>
<p>
This is fun, too; you should <em>try it</em>.
</p>
<p>
Cheers, <a href="https://www.bennadel.com" rel="noopener noreferrer">Ben Nadel</a>
</p>
<pre class="language-js">
<code>var x = prompt( "Get to the choppa!" );</code>
</pre>
This time, not only did we get all the blocklisted content removed by the Safelist
specification, we also manually removed "inert" anchor tags and funky class
attributes.
I really like this! I find it easier to understand the Safelist
specification when compared to the more flexible but more opaque XML configuration files used by AntiSamy. And, I absolutely love that I can easily perform additional sanitization and clean-up using the JSoup API. And, it's only one JAR file! You just can't beat JSoup's simplicity of ease-of-consumption.
Want to use code from this post? Check out the license.
Reader Comments
Is that 3 quote syntax really correct?
.addTags([ "strong", em" ])
@Dan,
OMG! I can't tell you how long I stared at that code trying to figure why the syntax highlighting wasn't working! I just couldn't see it. I was 2 seconds away from deleting it and rewriting it, but then I had to get up from the desk. Thanks for seeing the issue and letting me know. It's been fixed 🙌
@All, here's a fast-follow that looks at how to use JSoup's
Safelist
API to report on the untrusted HTML elements and attributes:www.bennadel.com/blog/4723-using-jsoup-to-report-untrusted-html-elements-and-attributes-in-coldfusion.htm
This is something that I will need to do on this blog when I convert it over to using JSoup. I can't (or rather don't want to) remove user comment elements without letting them know. Users need a chance to go in and fix their issues (which are "markdown mistakes" in 99% of cases).
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →