Skip to main content
Ben Nadel at RIA Unleashed (Nov. 2010) with: Chris Bickford
Ben Nadel at RIA Unleashed (Nov. 2010) with: Chris Bickford

Using The OWASP Java HTML Sanitizer In Lucee CFML 5.3.7.48 To Sanitize HTML Input And Prevent XSS Attacks

By
Published in Comments (8)

Earlier this week, at the Adobe ColdFusion Developer Conference, Charlie Arehart mentioned that the OWASP AntiSamy project was added to Adobe ColdFusion 11. I started using the AntiSamy project back in ColdFusion 10, and hadn't realized that it was now a native part of the ColdFusion runtime. This inspired me to go back and re-read my old post wherein I remembered that Matthew Clemente mentioned yet another OWASP project of relevance called the Java HTML Sanitizer. To keep things exciting, I decided to play around a bit with this Java HTML Sanitizer project in Lucee CFML 5.3.7.48.

View this code in my OWASP Java HTML Sanitizer With Lucee CFML 5.3.7.48 project on GitHub.

The OWASP Java HTML Sanitizer project works very much like the OWASP AntiSamy project in so much as you define a policy that outlines what you want to allow in an untrusted input; and then, you can process the input against that policy in order to produced safe, trusted output HTML.

What makes the OWASP Java HTML Sanitizer project nice is that, instead of using an XML file as you do with AntiSamy, your policy is defined in-code using a fluent API. There's nothing wrong with having to use an XML file - it just feels a bit outdated. And, defining your policy in-code means that you get to leverage all of the flexibility that your ColdFusion / Java runtime offers.

To get this working, I went to the Maven Repository and manually downloaded all of the JAR files necessary for version 20200713.1. I'm sure there's a really easy command-line way to do this; but, I never learned it. Then, once I had all the JAR files stored locally, I used Lucee CFML's ability to create Java classes using JAR paths.

Within the OWASP Java HTML Sanitizer, everything is blocked by default. You have to use the policy builder to allow-list specific elements and attributes within your untrusted input. In the following demo, I am allow-listing a few HTML elements and just a handful of attributes. Attributes can either be allow-listed globally; or, locked-down to a specific set of HTML elements.

Here's a simple input / output demo - not that the <a> tag in my first paragraph contains a persisted XSS (Cross-Site Scripting) attack:

<cfscript>

	// This is the untrusted HTML input that we need to sanitize.
	```
	<cfsavecontent variable="htmlInput">

		<p>
			Check out
			<a href="https://www.bennadel.com" target="_blank" onmousedown="alert( 'XSS!' )">my site</a>.
		</p>

		<marquee loop="-1" width="100%">
			I am very trustable! You can totes trust me!
		</marquee>

		<p>
			<strong>Thanks for stopping by!</strong> <em>You Rock!</em> &amp;
			<blink>Woot!</blink>
		</p>

	</cfsavecontent>
	```

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	Pattern = createObject( "java", "java.util.regex.Pattern" );

	// The Policy Builder has a number of fluent APIs that allow us to incrementally
	// define the sanitization policy. It primarily consists of allow-listing elements
	// and attributes (usually in the context of a given set of elements).
	policyBuilder = javaNew( "org.owasp.html.HtmlPolicyBuilder" )
		.init()
		.allowElements([
			"p", "div",
			"br",
			"a",
			"b", "strong",
			"i", "em",
			"ul", "ol", "li"
		])
		.allowUrlProtocols([ "http", "https" ])
		.requireRelNofollowOnLinks()
		.allowAttributes([ "title" ])
			.globally()
		.allowAttributes([ "href", "target" ])
			.onElements([ "a" ])
		.allowAttributes([ "lang" ])
			.matching( Pattern.compile( "[a-zA-Z]{2,20}" ) )
			.globally()
		.allowAttributes([ "align" ])
			// NOTE: true = ignoreCase.
			.matching( true, [ "center", "left", "right", "justify" ] )
			.onElements([ "p" ])
	;
	policy = policyBuilder.toFactory();

	// Sanitize the HTML input.
	// --
	// NOTE: There's a more complicated invocation of the sanitization that allows you to
	// capture the block-listed elements and attributes that are removed from input. That
	// said, I could NOT FIGURE OUT how to do that - it looks like you might need to
	// write some actual Java code to provide the necessary arguments.
	sanitizedHtmlInput = policy.sanitize( htmlInput );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	```
	<h1>
		OWASP Java Html Sanitizer
	</h1>

	<h2>
		Untrusted Input
	</h2>

	<cfoutput>
		<!--- NOTE: I'm dedenting the indentation incurred by the CFSaveContent tag. --->
		<pre>#encodeForHtml( htmlInput.reReplace( "(?m)^\t\t", "", "all" ).trim() )#</pre>
	</cfoutput>

	<h2>
		Sanitized Input
	</h2>

	<cfoutput>
		<!--- NOTE: I'm dedenting the indentation incurred by the CFSaveContent tag. --->
		<pre>#encodeForHtml( sanitizedHtmlInput.reReplace( "(?m)^\t\t", "", "all" ).trim() )#</pre>
	</cfoutput>
	```

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I load the given Java class using the underlying JAR files.
	*/
	public any function javaNew( required string className ) {

		// I downloaded these from the Maven Repository (manually since I don't actually
		// know how Maven works).
		// --
		// https://mvnrepository.com/artifact/com.googlecode.owasp-java-html-sanitizer/owasp-java-html-sanitizer/20200713.1
		var jarFiles = [
			"./vendor/owasp-java-html-sanitizer-20200713.1/animal-sniffer-annotations-1.17.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/checker-qual-2.5.2.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/error_prone_annotations-2.2.0.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/failureaccess-1.0.1.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/guava-27.1-jre.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/j2objc-annotations-1.1.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/jsr305-3.0.2.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar",
			"./vendor/owasp-java-html-sanitizer-20200713.1/owasp-java-html-sanitizer-20200713.1.jar"
		];

		return( createObject( "java", className, jarFiles ) );

	}

</cfscript>

Once you have your Policy (generated from the Policy Builder), all you have to do is call .sanitize(input) and you get a safe HTML result. In this version of the code, you don't get a report of the elements / attributes that were removed from the input. There's a more complicated version of the sanitization process that user some sort of an event-emitter to track the filtering process. Unfortunately, I couldn't get that to work as it required more Java know-how than I have.

That said, when we run the above ColdFusion code, we get the following output:

As you can see, elements and attributes that were not explicitly allow-listed have been removed. And, some link-spam and opener attack protection was also added.

ASIDE: The noopener noreferrer rel attribute values are meant to protect against an attack pattern known as Reverse Tabnabbing. Isn't protecting a web application fun?!

The OWASP (Open Web Application Security Project) projects are pretty dang amazing! And since they work primarily with Java, it means that pulling them into a ColdFusion or Lucee CFML application is usually a low-effort, high-return endeavor.

Want to use code from this post? Check out the license.

Reader Comments

15,902 Comments

@Gary,

That's a tricky one. It looks like ElementPolicy needs to implement an apply() method. And, I think the code that you linked to is implementing that as an inline Function expression (I'm not really a Java developer, so that syntax isn't entirely familiar to me). This is one area of the ColdFusion / Java integration that I've never been very familiar with. I think this is one of the places where you would need to implement a "Dynamic Proxy"; but, this is not something I've ever been able to do successfully -- especially when part of the Java code is being loaded dynamically (I always get class-loader mismatch issues).

Sorry! I wish I could be more help.

15,902 Comments

@Lionel,

I haven't personally used jSoup's cleaning functions. But, I have it bookmarked to look into. Last June, James Moberg was talking about it on Twitter:

I prefer using #jsoup because it allows rewriting, injecting & writing complex functions to analyze and perform other functions (ie, converting HTML5/CSS3 to email-friendly HTML) rather than just stripping which it can do too. It's also a single JAR....

It has a multiple preconfigured cleaner safelist settings you can use to sanitize HTML and return clean code. Even intentionally bad/invalid HTML gets fixed & cleaned.

I'll ping him on Twitter to see if he can add any additional color.

5 Comments

I did some experimenting with jsoup and found that even with the relaxed Safelist, it's removing things I don't want it to such as style and target attributes. I assume the default behavior can be modified, but that's also true of OWASP.

For years I've been using jsoup for parsing links from email newsletters and adding tracking urls, but for sanitizing HTML, I don't need all of its potential bells and whistles for other purposes, at least not yet.

I'm also confident using OWASP for web app security because that is their primary expertise. Their HTML Sanitizer is maintained by Google AppSec engineer Mike Samuel and undergoes constant adversarial security review.

As for the point about jsoup being a single jar file, for Lucee users it doesn't get any easier than the new SanitizeHTML() function that was introduced just two months ago. And since it uses OWASP Java HTML Sanitizer, all of the built-in policies are applied by default, or it can be customized via PolicyFactory, all without any additional jar files.

Here's a working example I saved at TryCF.

15,902 Comments

@Lionel,

Also, for what's it's worth, I had mentioned James Moberg using jSoup and here's what he responded with on Twitter:

In the past, I had tried to use OWASP in a CF project, but an older version was used by Adobe & I didn't want to wait to see if they'd update it (since they haven't updated other libs like iText.) I started using jsoup because I had full control over the version that was used.

... so it seems that he chose jSoup more out of convenience than a heavy preference for the library.

That's awesome that Lucee has the Java sanitizer built-in now. I wonder what version that was released in. That's one thing that I miss from the Adobe docs - they were really good about including the versions in each page's "History" annotation.

One really nice things about this is that you can build the "policy" in the code, as opposed to having a separate XML configuration file. There's something nice about having everything "in the code". Just easier to read, I think.

5 Comments

Looking at the GitHub repo for Lucee's ESAPI extension, I see that the commit was actually farther back on Feb 23, but apparently it wasn't until April when Jake01 in the dev forum asked about sanitizing HTML that it was then announced and bundled in Lucee 5.3.9 RC3.

The responsiveness of the Lucee community is pretty awesome. 🥰

The documentation is sometimes lacking, though, for example how to define custom policies, which is surprising because the default policies leave much to be desired. I had to figure it out first of all with your help, Ben. 😃 I also got a clue from emjay2 re: chaining .toFactory() so the intermediate policyBuilder var isn't needed. And then I dug deep into the Java source for HtmlPolicyBuilder and Sanitizers to see exactly what the default policies and specific methods are doing.

I'm developing a SafeHTML() function that uses SanitizeHTML() with my custom policy that allows all "safe" HTML, and also StripHTML() for validating user input when it's supposed to be plain text, again using SanitizeHTML(), but with an empty policy so that no HTML is allowed. These functions will help in automating the defense against XSS attacks. I intend my StripHTML() to be a drop-in replacement for the cflib.org UDF that was originally published by Ray Camden way back in 2014, because regex doesn't cut it anymore in this modern era of web app security.

I'm planning to share these full circle back into the Lucee community. ❤️

15,902 Comments

@Lionel,

+1 on the Lucee community being super responsive on the Dev-forums. That's been my experience as well. Also +1 on the documentation sometimes being a little lacking (though I feel bad because I know it's open-source and I could technically help 😨).

Sounds like fun stuff that you're building! I'll definitely take a look if you publish it anywhere. I've been super keen on HTML-related parsing and rendering lately. There's a lot of power to unlock with some of this stuff.

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel