Using jSoup To Clean-Up And Normalize HTML In ColdFusion 2021

By Ben Nadel

Published 2022-02-08 in ColdFusion — Comments (6)

I would love to say that all of the content stored in my blogging database is in pristine, production-ready state. But, it's not. A lot of it has old, historical choices that need to be cleaned-up. And, some formatting choices simply can't be persisted safely (such as CDN - Content-Delivery Network - domains). As such, I will always have to do some degree of pre-render processing on my persisted HTML content before I show it to the user. And, as of yesterday, I started performing that clean-up and sanitization using jSoup in ColdFusion 2021.

I first looked at using jSoup with ColdFusion back in 2012. jSoup parses HTML into a Document Object Model (DOM) that provides traversal and mutation methods inspired by jQuery. Which means, the API is simple, familiar, and very easy to use!

Unfortunately, I didn't give jSoup much more thought after that, even though James Moberg kept suggested it every time I did anything with HTML (such as retrofit 15-years of blog content onto Markdown using Lucee CFML's parseHtml() functionality or sanitize user-generated comments with OWASP AntiSamy). But, now that I'm using it again, I'm already starting to think of other ways it can add value.

Pulling jSoup into ColdFusion is quite easy - it's just a single JAR file. It can be added to your lib folder, defined in your Application's this.javaSettings, consumed in createObject() (when using Lucee CFML), or loaded using the JavaLoader project. I happen to be using JavaLoader:

component {

	/**
	* I get called once when the application is being bootstrapped.
	*/
	public void function onApplicationStart() {

		application.jSoupJavaLoader = javaLoaderFactory.getJavaLoader([
			expandPath( "/jars/jsoup-1.14.3/jsoup-1.14.3.jar" )
		]);

	}

}

With this JavaLoader, parsing HTML into a jSoup Document Object Model (DOM) is as simple as calling:

jSoupJavaLoader.create( "org.jsoup.Jsoup" ).parse( html_content )

This parses the given HTML fragment into a Document node that is basically like a normal HTML page document, complete with html, head, and body elements. However, since I'm parsing content that doesn't represent a full HTML document, I then dip-down into the body element before I start performing my clean-up:

document.body()

And, once my clean-up workflow is complete, I'll get the innerHTML of the body so that I don't pull-in the irrelevant html and head tags into my rendered content:

document.body().html()

The jSoup DOM is a tree of Node instances. The API of these objects is part jQuery, part native DOM methods, part custom methods. Much of the power for me comes from the jQuery-like methods, such as:

.select() - equivalent to jQuery's .find()
.attr()
.addClass()
.appendTo()
.after()
.html()
.text()
.empty()
.is()
.val()
.wrap()

To pre-process my user-generated content, I'm basically repeating the same workflow several times with different logic:

Select matching elements.
Iterate over matching elements.
Update attributes of each element.

I can already think of more stuff that I want to do in my pre-processing (such as injecting header links); but, for now, here's what I've got:

Clean-up site-local references. When I include links and images in my content, I'm usually copy-pasting fully-qualified URLs. That is, URLs that include both the protocol and the domain. But, I don't need all that jazz for site-local URLs. Instead, all URLs can just start with / and be root-relative. Doing this will also make other clean-up activities a bit easier.
Inject CDN domain. While I don't want to server static assets from the CDN (Content-Delivery Network) locally in development, I do want to do that in production. However, the CDN domain isn't persisted in the database. As such, I need to prepend it to the root-relative URLs generated in the first step.
Clean-up old style links. Over 15-years, my URL/routing scheme has changed. Which means that I have embedded URLs with an outdated format. I can find this, look-up the relevant blog-post, and then rewrite the href attribute on these links.
Clean-up slugs. Just as my URL/routing scheme has changed over the years, so has the normalization of my route slugs. In order to reduce the 301 redirects that my site has to generate, I can look up each href value and make sure that it is using the most current slug formatting.
Proxy InVision links. When I link over to something work related, I like to proxy the link through a local bio page that provides a little background on what InVision is and why I co-founded it.
Remove https:// in text. When rendering URL-text in the comments, for superficial reasons, I like to strip-out the protcol. I just think it makes the URLs look nicer.

Each one of these steps is broken-out into its own method in my BlogPostNormalizer.cfc ColdFusion component. This ColdFusion component has a single method, normalizeContent(), which calls these in a specific order so that each one may make assumptions about the current state of the content.

NOTE: I'm using the property tags to drive dependency-injection (DI) in ColdFusion; but, I'm not going to bother showing the wiring-up of the component since I don't think it's terribly relevant to the topic.

component
	accessors = true
	output = false
	hint = "I provide some helper methods to clean and normalize the pre-rendering of blog post content."
	{

	// Define properties for dependency-injection.
	property config;
	property jSoupJavaLoader;
	property partialGateway;
	property utilities;

	// ---
	// PUBLIC METHODS.
	// ---

	/**
	* I apply some pre-render normalization to the given blog post / comment content.
	*/
	public string function normalizeContent(
		required string content,
		boolean stripRenderedProtocol = false
		) {

		// The jSoup library allows us to parse, traverse, and mutate HTML on the
		// ColdFusion server using a familiar jQuery-inspired syntax.
		var dom = jSoupJavaLoader
			.create( "org.jsoup.Jsoup" )
			.parse( content )
			.body()
		;

		// When authors and commenters paste links into the content, they paste them as
		// fully-qualified URLs. However, there's no need to include the domain in these
		// site-local links. Let's strip-off any domain information. This will also make
		// subsequent parsing easier.
		// --
		// CAUTION: Subsequent clean-up and transformation steps will assume that they can
		// depend on root-relative paths. As such, ALWAYS RUN THIS STEP FIRST!
		cleanUpLocalReferences( dom );
		// I want to serve static assets from the CDN domain. However, the assets aren't
		// entered with the CDN domain in mind (especially since it's not the same domain
		// in every environment). Let's prepend all static asset URLs with the CDN domain.
		cleanUpCdnReferences( dom );
		// Way back in the early days of this blog, I was using a routing solution that I
		// called "DAX" (yet another home-grown thing). I want to replace those links with
		// my modern linking approach.
		cleanUpDaxReferences( dom );
		// The format of my blog-post SLUGS has changed over the years. Upon access, URLs
		// in an older format will be 301'd to the latest format. However, redirects hurt
		// performance (and therefore search engine ranking). As such, I want to clean-up
		// any old-style slugs with the latest formatting.
		cleanUpSlugs( dom );
		// Redirect InVision links to proxy through local InVision bio.
		cleanUpInVisionLinks( dom );
		// For strictly visual reasons, I want to remove "https://" protocol from the text
		// portion of the site-local links (at least in the comments).
		if ( stripRenderedProtocol ) {

			cleanUpVisualProtocol( dom );

		}

		return( dom.html() );

	}

	// ---
	// PRIVATE METHODS.
	// ---

	/**
	* I prepend any static asset link with the CDN (Content-Delivery Network) domain.
	* 
	* CAUTION: Assumes that all site-local links are root-relative in format.
	*/
	private void function cleanUpCdnReferences( required any dom ) {

		// All of my images are in year-based folders within the uploads folder.
		for ( var node in dom.select( "[src^='/resources/uploads']" ) ) {

			var src = ( config.cdnUrl & node.attr( "src" ) );
			node.attr( "src", src );

		}

	}


	/**
	* I replace old, DAX-style links with modern links to other blog posts.
	*/
	private void function cleanUpDaxReferences( required any dom ) {

		// All DAX-style links were in the form of "section:id.action". So, viewing a blog
		// with ID 4 would be "blog:4.view". Since the ID portion is dynamic, let's look
		// for all links with the static DAX prefix and then parse the ID out dynamically.
		for ( var node in dom.select( "[href*='dax=blog:']" ) ) {

			var parts = node.attr( "href" )
				.reFindNoCase( "dax=blog:(\d+)\.view", 1, true )
			;

			// False-positive on initial match, skip node.
			if ( ! parts.len[ 1 ] ) {

				continue;

			}

			// Pull-back the blog entry with the given ID from the DAX link.
			var postID = val( parts.match[ 2 ] );
			var post = partialGateway.getPostForUrlCleanup( postID );

			if ( ! post.recordCount ) {

				continue;

			}

			node.attr(
				"href",
				( "/blog/" & utilities.generateEntityFilename( post.id, post.name ) )
			);

		}

	}


	/**
	* I rewrite all InVision links to proxy through my local InVision bio page so that
	* people can see a back-story on InVision before they follow the URL.
	*/
	private void function cleanUpInVisionLinks( required any dom ) {

		for ( var node in dom.select( "[href*='invisionapp.com']" ) ) {

			// The old WWW links didn't have HTTPS - fix that when redirecting.
			var href = node.attr( "href" )
				.replaceNoCase( "http:", "https:" )
			;

			var proxiedHref = "/invision/co-founder.htm?redirect=#encodeForUrl( href )#";
			node.attr( "href", proxiedHref );

		}

	}


	/**
	* I remove the domain from site-local references (either links or images).
	*/
	private void function cleanUpLocalReferences( required any dom ) {

		// CAUTION: Since the content on this blog spans 15-years of stylistic and
		// technical evolution, we have to account for variations in subdomains and in
		// protocol. These will all be replaced with "/" root-relative paths.
		var domainPattern = "https?://(www\.|local\.)?bennadel\.com/";
		var domainReplacement = "/";

		// Make image references root-relative paths.
		for ( var node in dom.select( "[src*='bennadel.com']" ) ) {

			var src = node
				.attr( "src" )
				.reReplace( domainPattern, domainReplacement )
			;
			node.attr( "src", src );
			node.attr( "loading", "lazy" );

		}

		// Make anchor references root-relative paths.
		for ( var node in dom.select( "[href*='bennadel.com']" ) ) {

			var href = node
				.attr( "href" )
				.reReplace( domainPattern, domainReplacement )
			;
			node.attr( "href", href );

		}

	}


	/**
	* I clean up any older blog URL slugs with the latest slug format in order to reduce
	* the number of 301 redirects that the platform has to execute.
	* 
	* CAUTION: Assumes that old-school DAX URLs have already been cleaned-up.
	*/
	private void function cleanUpSlugs( required any dom ) {

		// All blog links are in the format of "/blog/{id}-{slug}.htm". As such, we can
		// locate blog links using the static prefix and then parse the ID out of the
		// link on a per-node basis.

		// CAUTION: Selectors assumes that all site-local links have already been made
		// root-relative (ie, start with "/").
		for ( var node in dom.select( "[href^='/blog/']" ) ) {

			var parts = node.attr( "href" )
				.reFindNoCase( "/blog/(\d+)-.+?\.htm", 1, true )
			;

			// False-positive on initial match, skip node.
			if ( ! parts.len[ 1 ] ) {

				continue;

			}

			var postID = val( parts.match[ 2 ] );
			var post = partialGateway.getPostForUrlCleanup( postID );

			if ( ! post.recordCount ) {

				continue;

			}

			// Apply normalized, modern slug format to link.
			node.attr(
				"href",
				( "/blog/" & utilities.generateEntityFilename( post.id, post.name ) )
			);

			// If the link doesn't have a TITLE, add traditional "Read article" title.
			if ( ! node.hasAttr( "title" ) ) {

				node.attr( "title", ( "Read article: " & post.name ) );

			}

		}

	}


	/**
	* I look for site-local links and then strip-out any "https://" protocol from the text
	* rendering. This makes the embedded-URLs a little more visually pleasing.
	*/
	private void function cleanUpVisualProtocol( required any dom ) {

		// This RegEx pattern-replacement replaces the matching domain with the first
		// captured group which is the domain less the protocol.
		var domainPattern = "https?://((www\.)?bennadel\.com)";
		var domainReplacement = "\1";

		// CAUTION: Selectors assumes that all site-local links have already been made
		// root-relative (ie, start with "/").
		for ( var node in dom.select( "a[href^=/]" ) ) {

			for ( var textNode in node.textNodes() ) {

				if ( textNode.text().reFindNoCase( domainPattern ) ) {

					var newText = textNode.text()
						.reReplaceNoCase( domainPattern, domainReplacement, "all" )
					;
					textNode.text( newText );

				}

			}

		}

	}

}

Before I started using jSoup yesterday, all of this stuff was being done strictly with Regular Expressions and java.util.Pattern-based iteration. It was really verbose and really complex. It's awesome to see (you'll just have to trust me here) how much more readable this work has become now that I'm using jSoup.

Some of these methods are doing a little bit more than advertised. For example, notice that I'm enforcing lazy-loading on all images at the same time that I'm cleaning up site-local URLs:

node.attr( "loading", "lazy" );

It's so great that jSoup makes stuff like this super easy! Can you imagine trying to inject loading="lazy" using RegEx pattern matching? Trust me, it's much more complicated and prone to error.

Anyway, a huge shout-out to James Moberg for tirelessly trying to get me to try (or re-try) new things! Now that I have jSoup running in my ColdFusion blogging platform, my brain is already coming up with fun things to do with it!

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4201

Reader Comments

Charlie Arehart Feb 8, 2022 at 5:52 PM

50 Comments

Great stuff, Ben. Jsoup is indeed amazing: good on you and James for reminding folks about it.

For those who may see Ben's use of javaloader here--and maybe have never used the this.javasettings he mentions, note that he shows using that in his 2012 jsoup post.

If it may motivate some to consider that feature (which was added in CF10), I'll note that it would entail just 1 new line and two minor changes to the code below. I started to offer that here, but it made the comment lengthy. You can discern it from his older post, and I just wanted to stress that it does show that. :-)

And yep, as he notes, Lucee offers still another approach where you can name the jar on the createobject call, simplifying things further.

Ben Nadel Feb 8, 2022 at 6:22 PM

15,996 Comments

@Charlie,

It's a good point. I mostly use JavaLoader because I already have it established in the app—so it's easy to just keep the same pattern going for all Java-related loading.

Charlie Arehart Feb 8, 2022 at 6:52 PM

50 Comments

@Ben,

Oh sure. I totally get that. And I had acknowledged (in the longer comment I opted to truncate) how that was likely your reasoning (or just being used to it). Again, my only reason for commenting was to help those who found the post but were NOT yet using javaloader, who might find that they'd not need to. :-)

And then I commented how there are indeed pros and cons to each...and that's part of what made the comment longer and longer. So I dropped all that, as I knew this wasn't the right place for that discussion.

Same with my dropping the simple code changes needed, figuring folks could find it in that other post. It just wasn't obvious from this post that you DID show its use in the other. :-)

Ben Nadel Feb 8, 2022 at 6:55 PM

15,996 Comments

@Charlie,

Totes ma'goates!

Ben Nadel Feb 10, 2022 at 11:34 AM

15,996 Comments

@All,

My brain is starting to think of all the fun and interesting ways I can now use jSoup to enhance my blog experience. The first idea that I had was to inject section-anchors:

www.bennadel.com/blog/4203-using-jsoup-to-inject-section-title-anchors-in-coldfusion-2021.htm

Now, each h2 - h6 title tag will get an # link prepended to it that allows users to jump directly to a given section on the page 💪

Ben Nadel Feb 12, 2022 at 1:28 PM

15,996 Comments

@All,

In a continuing effort to think about how I can use jSoup to enhance my ColdFusion blogging logic, I'm now using it to locate and extract images for Open Graph / Twitter Card previews:

www.bennadel.com/blog/4206-using-jsoup-to-extract-open-graph-twitter-card-images-in-coldfusion-2021.htm

This basically uses jSoup to parse the HTML content of the post and look for images in my /resources/uploads directory. Then, I can use the src and alt attributes on the selected elements to define <meta> tags for link unfurling in a social media context.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.