Using jSoup To Sanitize Untrusted HTML In ColdFusion

By Ben Nadel

Published 2024-11-07 in ColdFusion — Comments (3)

For years, I've been using the OWASP AntiSamy project to sanitize untrusted HTML in ColdFusion. And for years, James Moberg has suggested that I just use JSoup. I'm not one to switch tools unnecessarily. However, when I went to install AntiSamy in a new project and remembered just how many JAR files were required, I figured it was time to look at JSoup's single JAR approach to cleaning and sanitizing HTML.

JSoup is an amazing Java library that brings the magic of jQuery's fluent API and effortless document object model (DOM) manipulation into the server-side world of ColdFusion. I use it quite heavily on this blog to post-process and normalize my content, extract OpenGraph tag data, insert <h2> anchor links, and parse GitHub Gist data (just to name a few things). In short, it's been tremendously helpful.

But, I've only ever used it to operate on trusted HTML content—either content that I've written personally; or, content that's first been run through AntiSamy's sanitization process. JSoup exposes functionality similar to AntiSamy through the org.jsoup.safety package and its Safelist and Cleaner classes.

I love the fact that JSoup doesn't use XML! While AntiSamy's XML-based configuration files work perfectly well, it feels a bit anachronistic in 2024. JSoup's answer to this is to provide a "builder pattern" that exposes a fluent API for defining an allowlist of tags, attributes, and protocols that can appear in the source document.

For example, if you only want to allow for simple text formatting, you could add just the Bold and Italic tags to the allowlist:

  
          <cfscript>
        
          	safelist = create( "org.jsoup.safety.Safelist" )
        
          		.init()
        
          		.addTags([ "strong", "em" ])
        
          	;
        
          </cfscript>

view raw snippet-1.cfm hosted with ❤ by GitHub

Of course, this would only work for a single-line of input text since paragraph tags will be stripped out. If you want to allow for multiple lines of content, you have to add things like Paragraph and List tags to the allowlist:

  
          <cfscript>
        
          	safelist = create( "org.jsoup.safety.Safelist" )
        
          		.init()
        
          		.addTags([ "strong", "em" ])
        
          		.addTags([ "p", "ul", "ol", "li" ])
        
          	;
        
          </cfscript>

view raw snippet-2.cfm hosted with ❤ by GitHub

You can also specify optional tag attributes, required tag attributes, and viable attribute protocols. To demonstrate this in more depth, let's define a file with an assortment of untrusted HTML content:

  
          <p>
        
          	Hey, check out my <a href="javascript:void(0)">awesome site</a>!
        
          </p>
        
          <p>
        
          	It's <strong onclick="alert(1)" class="highlite">so <a>great</a>!</strong>
        
          </p>
        
          <p>
        
          	<malicious>This is fun, too</malicious>; you should <em>try it</em>.
        
          </p>
        
          <p>
        
          	Cheers, <a href="https://www.bennadel.com" target="_blank">Ben Nadel</a>
        
          </p>
        
          <script>
        
          	alert(1);
        
          </script>
        
          <pre class="language-js"><code class="so-cool">var x = prompt( "Get to the choppa!" );</code></pre>

view raw input.html hosted with ❤ by GitHub

Note that this untrusted HTML contains a javascript: protocol, an onclick attribute, a <malicious> tag, and some unexpected class attribute values. To remove these unwanted HTML constructs, we're going to:

Define an instance of Safelist.
Add only the elements, attributes, and protocols that we want.
Run the untrusted HTML through JSoup's static .clean() method.

Then, we'll output the sanitized version of the HTML in our ColdFusion response:

  
          <cfscript>
        
          	// The Safelist class is used to define which elements, attributes, and protocols are
        
          	// allowed (and which are mandated). The Safelist class also provides some default
        
          	// configurations that are provided as static methods (ex, .simpleText(), .basic(),
        
          	// basicWithImages()).
        
          	safelist = create( "org.jsoup.safety.Safelist" )
        
          		.init()
        
          		// Add basic text formatting elements. This allows these these tag-names to be
        
          		// included; but, doesn't inherently allow for any tag attributes. At this point,
        
          		// any attributes in these tags are quietly stripped during the sanitization.
        
          		.addTags([ "strong", "b", "em", "i", "u" ])
        
          		// Add basic text layout elements. This allows these these tag-names to be
        
          		// included; but, doesn't inherently allow for any tag attributes.
        
          		.addTags([ "p", "blockquote", "ul", "ol", "li", "br" ])
        
          		// Add code block related elements.
        
          		.addTags([ "pre", "code" ])
        
          		// Code block elements usually have a language specification via the class.
        
          		.addAttributes( "pre", [ "class" ] )
        
          		.addAttributes( "code", [ "class" ] )
        
          		// Add anchor links.
        
          		.addTags([ "a" ])
        
          		.addAttributes( "a", [ "href" ] )
        
          		// We don't want to allow "javascript:" or "mailto:" protocols - allowlist only
        
          		// the standard http protocols.
        
          		.addProtocols( "a", "href", [ "http", "https" ] )
        
          		// Force every anchor link to have a security-oriented rel directives. Even if the
        
          		// input has an existing [rel] attribute, it will be overwritten.
        
          		// --
        
          		// Note: I'm omitting "nofollow" in the spirit of a connected web.
        
          		.addEnforcedAttribute( "a", "rel", "noopener noreferrer" )
        
          	;
        
          	// ------------------------------------------------------------------------------- //
        
          	// ------------------------------------------------------------------------------- //
        
          	untrustedHtml = fileRead( "./input.html", "utf-8" );
        
          	// Sanitize the untrusted HTML using our defined Safelist. This returns a string. If
        
          	// we need to return a DOM instance, we have to use the Clean class directly rather
        
          	// that the JSoup static method (we'll do this in the next demo).
        
          	trustedHtml = create( "org.jsoup.Jsoup" )
        
          		.clean( untrustedHtml, safelist )
        
          	;
        
          	echo( "<pre>" & encodeForHtml( trustedHtml ) & "</pre>" );
        
          	// ------------------------------------------------------------------------------- //
        
          	// ------------------------------------------------------------------------------- //
        
          	/**
        
          	* I create an instance of the given JSoup package class.
        
          	*/
        
          	private any function create( required string className ) {
        
          		var jarPaths = [ "./jsoup-1.18.1.jar" ];
        
          		return createObject( "java", className, jarPaths );
        
          	}
        
          </cfscript>

view raw test.cfm hosted with ❤ by GitHub

If we run this ColdFusion code, parse the HTML, sanitize it, and then echo the resultant HTML string, we get the following output (I've added some whitespace for readability):

  
          <p>
        
          	Hey, check out my <a rel="noopener noreferrer">awesome site</a>!
        
          </p>
        
          <p>
        
          	It's <strong>so <a rel="noopener noreferrer">great</a>!</strong>
        
          </p>
        
          <p>
        
          	This is fun, too; you should <em>try it</em>.
        
          </p>
        
          <p>
        
          	Cheers, <a href="https://www.bennadel.com" rel="noopener noreferrer">Ben Nadel</a>
        
          </p>
        
          <pre class="language-js">
        
          	<code class="so-cool">var x = prompt( "Get to the choppa!" );</code>
        
          </pre>

view raw test--output.html hosted with ❤ by GitHub

This is a great first step! None of the malicious / block-listed content made it through the sanitization process. There's no javascript: protocol, there's no onclick attribute, and there's no <malicious> element. But, the output is a little strange. Now, we have <a> tags that don't really do anything. And, we have a <code> element with class="so-cool", which isn't an expected language specification.

When we use JSoup's .clean() method—as we did above—it uses the Cleaner class under the hood; and then both parses the input and serializes the response as HTML. To exercise a little more granular control over the sanitization process, we can use the Cleaner class directly. This will leave us with a sanitized Document Object Model (DOM) instead of a string. We can then use the DOM API to perform some additional clean-up tasks.

This time, after the Cleaner class applies the Safelist constraints, we're going to remove / unwrap any <a> tags with no href and we're going to remove the class attribute from any <pre> or <code> tags if it (the class attribute value) doesn't match a language specification.

Note that this is the same Safelist specification as we used above; only I've removed all of the commenting for the sake of brevity.

  
          <cfscript>
        
          	// Define our allow-listed markup.
        
          	safelist = create( "org.jsoup.safety.Safelist" )
        
          		.init()
        
          		.addTags([ "strong", "b", "em", "i", "u" ])
        
          		.addTags([ "p", "blockquote", "ul", "ol", "li", "br" ])
        
          		.addTags([ "pre", "code" ])
        
          		.addAttributes( "pre", [ "class" ] )
        
          		.addAttributes( "code", [ "class" ] )
        
          		.addTags([ "a" ])
        
          		.addAttributes( "a", [ "href" ] )
        
          		.addProtocols( "a", "href", [ "http", "https" ] )
        
          		.addEnforcedAttribute( "a", "rel", "noopener noreferrer" )
        
          	;
        
          	// ------------------------------------------------------------------------------- //
        
          	// ------------------------------------------------------------------------------- //
        
          	untrustedDom = create( "org.jsoup.Jsoup" )
        
          		.parse( fileRead( "./input.html", "utf-8" ) )
        
          	;
        
          	// This time, instead of using the JSoup static method, we're going to create an
        
          	// instance of the cleaner class. This allows us to both pass-in and receive a
        
          	// Document instead of a string which will allow us to do some post-processing.
        
          	trustedDom = create( "org.jsoup.safety.Cleaner" )
        
          		.init( safelist )
        
          		.clean( untrustedDom )
        
          	;
        
          	// At this point, our DOM contains a sanitized structure. However, this structure
        
          	// includes some funky elements - like anchor tags with no [href] attribute. Let's
        
          	// remove (unwrap) anchors that don't have an [href] attribute.
        
          	for ( element in trustedDom.select( "a" ) ) {
        
          		if ( element.attr( "href" ) == "" ) {
        
          			element.unwrap();
        
          		}
        
          	}
        
          	// We also might have pre and code elements with unsupported class attributes. Let's
        
          	// remove class attributes that don't match the regular expression for a language
        
          	// specification. This will leave the pre/code elements in place, but strip out the
        
          	// [class] attribute.
        
          	for ( element in trustedDom.select( "pre[class], code[class]" ) ) {
        
          		if ( ! element.attr( "class" ).reFind( "^language-[a-zA-Z0-9]+$" ) ) {
        
          			element.removeAttr( "class" );
        
          		}
        
          	}
        
          	// Now that we have a sanitized DOM with some additional post-processing, we can
        
          	// manually serialize it into a trusted HTML payload.
        
          	trustedHtml = trustedDom
        
          		.body()
        
          		.html()
        
          	;
        
          	echo( "<pre>" & encodeForHtml( trustedHtml ) & "</pre>" );
        
          	// ------------------------------------------------------------------------------- //
        
          	// ------------------------------------------------------------------------------- //
        
          	/**
        
          	* I create an instance of the given JSoup package class.
        
          	*/
        
          	private any function create( required string className ) {
        
          		var jarPaths = [ "./jsoup-1.18.1.jar" ];
        
          		return createObject( "java", className, jarPaths );
        
          	}
        
          </cfscript>

view raw test2.cfm hosted with ❤ by GitHub

With this approach, we have to explicitly parse and serialize the HTML. But, that gives an opportunity to work directly with the DOM; which, in turn, gives us much lower-level control. And, when we run this Lucee CFML code, we get the following HTML output (I've added whitespace for readability):

  
          <p>
        
          	Hey, check out my awesome site!
        
          </p>
        
          <p>
        
          	It's <strong>so great!</strong>
        
          </p>
        
          <p>
        
          	This is fun, too; you should <em>try it</em>.
        
          </p>
        
          <p>
        
          	Cheers, <a href="https://www.bennadel.com" rel="noopener noreferrer">Ben Nadel</a>
        
          </p>
        
          <pre class="language-js">
        
          	<code>var x = prompt( "Get to the choppa!" );</code>
        
          </pre>

view raw test2--output.html hosted with ❤ by GitHub

This time, not only did we get all the blocklisted content removed by the Safelist specification, we also manually removed "inert" anchor tags and funky class attributes.

I really like this! I find it easier to understand the Safelist specification when compared to the more flexible but more opaque XML configuration files used by AntiSamy. And, I absolutely love that I can easily perform additional sanitization and clean-up using the JSoup API. And, it's only one JAR file! You just can't beat JSoup's simplicity of ease-of-consumption.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4722

Reader Comments

Dan LeGate Nov 8, 2024 at 5:59 PM

3 Comments

Is that 3 quote syntax really correct?

.addTags([ "strong", em" ])

Ben Nadel Nov 8, 2024 at 6:23 PM

15,978 Comments

@Dan,

OMG! I can't tell you how long I stared at that code trying to figure why the syntax highlighting wasn't working! I just couldn't see it. I was 2 seconds away from deleting it and rewriting it, but then I had to get up from the desk. Thanks for seeing the issue and letting me know. It's been fixed 🙌

Ben Nadel Nov 9, 2024 at 12:24 PM

15,978 Comments

@All, here's a fast-follow that looks at how to use JSoup's Safelist API to report on the untrusted HTML elements and attributes:

www.bennadel.com/blog/4723-using-jsoup-to-report-untrusted-html-elements-and-attributes-in-coldfusion.htm

This is something that I will need to do on this blog when I convert it over to using JSoup. I can't (or rather don't want to) remove user comment elements without letting them know. Users need a chance to go in and fix their issues (which are "markdown mistakes" in 99% of cases).

	<cfscript>

	safelist = create( "org.jsoup.safety.Safelist" )
	.init()
	.addTags([ "strong", "em" ])
	;

	</cfscript>

	<p>
	Hey, check out my <a href="javascript:void(0)">awesome site</a>!
	</p>
	<p>
	It's <strong onclick="alert(1)" class="highlite">so <a>great</a>!</strong>
	</p>
	<p>
	<malicious>This is fun, too</malicious>; you should <em>try it</em>.
	</p>
	<p>
	Cheers, <a href="https://www.bennadel.com" target="_blank">Ben Nadel</a>
	</p>
	<script>
	alert(1);
	</script>
	<pre class="language-js"><code class="so-cool">var x = prompt( "Get to the choppa!" );</code></pre>

	<cfscript>

	// The Safelist class is used to define which elements, attributes, and protocols are
	// allowed (and which are mandated). The Safelist class also provides some default
	// configurations that are provided as static methods (ex, .simpleText(), .basic(),
	// basicWithImages()).
	safelist = create( "org.jsoup.safety.Safelist" )
	.init()
	// Add basic text formatting elements. This allows these these tag-names to be
	// included; but, doesn't inherently allow for any tag attributes. At this point,
	// any attributes in these tags are quietly stripped during the sanitization.
	.addTags([ "strong", "b", "em", "i", "u" ])

	// Add basic text layout elements. This allows these these tag-names to be
	// included; but, doesn't inherently allow for any tag attributes.
	.addTags([ "p", "blockquote", "ul", "ol", "li", "br" ])

	// Add code block related elements.
	.addTags([ "pre", "code" ])
	// Code block elements usually have a language specification via the class.
	.addAttributes( "pre", [ "class" ] )
	.addAttributes( "code", [ "class" ] )

	// Add anchor links.
	.addTags([ "a" ])
	.addAttributes( "a", [ "href" ] )
	// We don't want to allow "javascript:" or "mailto:" protocols - allowlist only
	// the standard http protocols.
	.addProtocols( "a", "href", [ "http", "https" ] )
	// Force every anchor link to have a security-oriented rel directives. Even if the
	// input has an existing [rel] attribute, it will be overwritten.
	// --
	// Note: I'm omitting "nofollow" in the spirit of a connected web.
	.addEnforcedAttribute( "a", "rel", "noopener noreferrer" )
	;

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	untrustedHtml = fileRead( "./input.html", "utf-8" );
	// Sanitize the untrusted HTML using our defined Safelist. This returns a string. If
	// we need to return a DOM instance, we have to use the Clean class directly rather
	// that the JSoup static method (we'll do this in the next demo).
	trustedHtml = create( "org.jsoup.Jsoup" )
	.clean( untrustedHtml, safelist )
	;

	echo( "<pre>" & encodeForHtml( trustedHtml ) & "</pre>" );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I create an instance of the given JSoup package class.
	*/
	private any function create( required string className ) {

	var jarPaths = [ "./jsoup-1.18.1.jar" ];

	return createObject( "java", className, jarPaths );

	}

	</cfscript>

	<p>
	Hey, check out my <a rel="noopener noreferrer">awesome site</a>!
	</p>
	<p>
	It's <strong>so <a rel="noopener noreferrer">great</a>!</strong>
	</p>
	<p>
	This is fun, too; you should <em>try it</em>.
	</p>
	<p>
	Cheers, <a href="https://www.bennadel.com" rel="noopener noreferrer">Ben Nadel</a>
	</p>
	<pre class="language-js">
	<code class="so-cool">var x = prompt( "Get to the choppa!" );</code>
	</pre>

	<cfscript>

	// Define our allow-listed markup.
	safelist = create( "org.jsoup.safety.Safelist" )
	.init()
	.addTags([ "strong", "b", "em", "i", "u" ])
	.addTags([ "p", "blockquote", "ul", "ol", "li", "br" ])
	.addTags([ "pre", "code" ])
	.addAttributes( "pre", [ "class" ] )
	.addAttributes( "code", [ "class" ] )
	.addTags([ "a" ])
	.addAttributes( "a", [ "href" ] )
	.addProtocols( "a", "href", [ "http", "https" ] )
	.addEnforcedAttribute( "a", "rel", "noopener noreferrer" )
	;

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	untrustedDom = create( "org.jsoup.Jsoup" )
	.parse( fileRead( "./input.html", "utf-8" ) )
	;

	// This time, instead of using the JSoup static method, we're going to create an
	// instance of the cleaner class. This allows us to both pass-in and receive a
	// Document instead of a string which will allow us to do some post-processing.
	trustedDom = create( "org.jsoup.safety.Cleaner" )
	.init( safelist )
	.clean( untrustedDom )
	;

	// At this point, our DOM contains a sanitized structure. However, this structure
	// includes some funky elements - like anchor tags with no [href] attribute. Let's
	// remove (unwrap) anchors that don't have an [href] attribute.
	for ( element in trustedDom.select( "a" ) ) {

	if ( element.attr( "href" ) == "" ) {

	element.unwrap();

	}

	}

	// We also might have pre and code elements with unsupported class attributes. Let's
	// remove class attributes that don't match the regular expression for a language
	// specification. This will leave the pre/code elements in place, but strip out the
	// [class] attribute.
	for ( element in trustedDom.select( "pre[class], code[class]" ) ) {

	if ( ! element.attr( "class" ).reFind( "^language-[a-zA-Z0-9]+$" ) ) {

	element.removeAttr( "class" );

	}

	}

	// Now that we have a sanitized DOM with some additional post-processing, we can
	// manually serialize it into a trusted HTML payload.
	trustedHtml = trustedDom
	.body()
	.html()
	;

	echo( "<pre>" & encodeForHtml( trustedHtml ) & "</pre>" );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I create an instance of the given JSoup package class.
	*/
	private any function create( required string className ) {

	var jarPaths = [ "./jsoup-1.18.1.jar" ];

	return createObject( "java", className, jarPaths );

	}

	</cfscript>

	<p>
	Hey, check out my awesome site!
	</p>
	<p>
	It's <strong>so great!</strong>
	</p>
	<p>
	This is fun, too; you should <em>try it</em>.
	</p>
	<p>
	Cheers, <a href="https://www.bennadel.com" rel="noopener noreferrer">Ben Nadel</a>
	</p>
	<pre class="language-js">
	<code>var x = prompt( "Get to the choppa!" );</code>
	</pre>

Reader Comments

Post A Comment — ❤️ I'd Love To Hear From You! ❤️

Post A Comment — I'd Love To Hear From You!