ColdFusion 10 - Parsing Dirty HTML Into Valid XML Documents

By Ben Nadel

Published 2012-02-28 in ColdFusion — Comments (9)

As I blogged earlier, ColdFusion 10 now supports XPath 2.0 in the xmlSearch() and xmlTransform() functions. This might not sound like a very exciting upgrade; however, when you realize that ColdFusion 10 now enables the parsing of "dirty" HTML code into valid XML documents, suddenly, the world of XML becomes a lot more interesting.

NOTE: At the time of this writing, ColdFusion 10 was in public beta.

ColdFusion 10 doesn't provide a native htmlParse() method; however, ColdFusion 10 now ships with the TagSoup 1.2 library pre-installed. This means that we can now instantiate the TagSoup classes and use them to convert our HTML documents into valid XML documents. And, of course, once we do that, we can use xmlSearch() to easily extract elements from our target HTML source code.

To demonstrate this functionality, I'm going to create "dirty" HTML content and then parse it into a searchable XML document. When I use the term "dirty," I simply mean that the HTML will have things like missing close-tags, missing attribute quotes, poor nesting, and upper-case element and attribute names.

<!---
	Create our "dirty" HTML document. Dirty in the sense that it
	cannot be parsed as valid XML. In order to make this document
	"bad", we'll have tags that don't self-close and perhaps a
	missing close-tag or two.
--->
<cfsavecontent variable="dirtyHtml">

	<!doctype html>
	<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title>Dana Linn Bailey</title>
		<meta name="description" content="Strong female muscle, FTW!">
		<meta name="keywords" content="female muscle,femmuscle,sexy">
	</head>
	<body>

		<h1>
			Dana Linn Bailey
		</h1>

		<h2>
			Professional Bodybuilder
		</h2>

		<p>
			<IMG
				SRC="//www.danalinn.com/images/photos/DanaLinnBailey_3.jpg"
				ALT="Dana Linn Bailey"
				HEIGHT=250>
			<br>
		</p>

		<h3>
			Professional Services
		</h3>

		<ul>
			<li>Full Contest Preparation
			<li>12-Week Weight Management Program
			<li>ONE-TIME Personalized Diet Plan
			<li>ONE-TIME Personalized Week Training Program
			<li>Train with DLB herself!!!
		</ul>

		<h2>
			Biography
		</h2>

		<p>
			I grew up a jock. At age 6, I was already on the swim
			team, waking up and going to practice just like the big
			kids. Up until high school, I was a 6-sport athlete all
			year round, playing soccer, basketball, field hockey,
			softball, running track and also swim team. In high
			school I continued with my 3 favorite sports, soccer,
			basketball, and field hockey and excelled in all with
			many awards.

		<p>
			<a href=http://www.danalinn.com/about.html>Read More</a>.
		</p>

	</body>
	</html>

</cfsavecontent>


<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->


<cfscript>


	// I take an HTML string and parse it into an XML(XHTML)
	// document. This is returned as a standard ColdFusion XML
	// document.
	function htmlParse( htmlContent, disableNamespaces = true ){

		// Create an instance of the Xalan SAX2DOM class as the
		// recipient of the TagSoup SAX (Simple API for XML) compliant
		// events. TagSoup will parse the HTML and announce events as
		// it encounters various HTML nodes. The SAX2DOM instance will
		// listen for such events and construct a DOM tree in response.
		var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

		// Create our TagSoup parser.
		var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();

		// Check to see if namespaces are going to be disabled in the
		// parser. If so, then they will not be added to elements.
		if (disableNamespaces){

			// Turn off namespaces - they are lame an nobody likes
			// to perform xmlSearch() methods with them in place.
			tagSoupParser.setFeature(
				tagSoupParser.namespacesFeature,
				javaCast( "boolean", false )
			);

		}

		// Set our DOM builder to be the listener for SAX-based
		// parsing events on our HTML.
		tagSoupParser.setContentHandler( saxDomBuilder );

		// Create our content input. The InputSource encapsulates the
		// means by which the content is read.
		var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
			createObject( "java", "java.io.StringReader" ).init( htmlContent )
		);

		// Parse the HTML. This will trigger events which the SAX2DOM
		// builder will translate into a DOM tree.
		tagSoupParser.parse( inputSource );

		// Now that the HTML has been parsed, we have to get a
		// representation that is similar to the XML document that
		// ColdFusion users are used to having. Let's search for the
		// ROOT document and return is.
		return(
			xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
		);

	}


	// ------------------------------------------------------ //
	// ------------------------------------------------------ //


	// Parse the "dirty" HTML into a valid XML document.
	xhtml = htmlParse( dirtyHtml );

	// Query for the head contents.
	headContents = xmlSearch( xhtml, "/html/head/*" );

	// Query for the body contents.
	bodyContents = xmlSearch( xhtml, "/html/body/*" );

// Output the two values.
writeDump( headContents );
writeDump( bodyContents );


</cfscript>

As you can see, the HTML code is pretty sloppy. And still, we take our HTML document, run it through htmlParse(), and then search the resultant XML document for various elements. When we run the above code, we get the following page output:

Parsing HTML code into XML documents using ColdFusion 10 and TagSoup.

As you can see, the dirty HTML was successfully parsed into a valid ColdFusion XML document which we were able to search with XPath 2.0 and xmlSearch(). The TagSoup library was able to convert our element and attribute names to lowercase, handle tags that don't require closing (ie. BR and IMG), and close tags that were improperly left open.

The TagSoup library, on its own, is nothing new. I tried playing around with it a few years ago, loading it into the ColdFusion context with a Groovy class loader. The difference here is that TagSoup now ships with ColdFusion 10. Of course, now that ColdFusion 10 allows per-application Java Class loading, this becomes much less of an issue. But still, pretty cool!

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/2341

Reader Comments

Ben Nadel Feb 28, 2012 at 10:18 AM

15,983 Comments

@All,

ColdFusion 10 also appears to ship with the NekoHTML parser as well:

http://nekohtml.sourceforge.net/

However, from some brief experimentation, I was getting better results with less effort from the TagSoup parser.

Brian Swartzfager Feb 28, 2012 at 10:19 AM

27 Comments

Very cool! I love the idea of being able to reliably parse HTML pages into XML data.

Marko Feb 28, 2012 at 2:51 PM

13 Comments

Really Cool! That's really useful for all your scraping needs!

WebManWalking Feb 28, 2012 at 7:07 PM

290 Comments

@Ben,

This may give XHTML the biggest boost it's ever gotten! Rapid Application Development + cleanup = actual use!

Nelle Feb 29, 2012 at 6:15 AM

27 Comments

@Ben,

"Of course, now that ColdFusion 10 allows per-application Java Class loading"

What is this new witchcraft you mention ?

Peter Pham Feb 29, 2012 at 6:16 AM

3 Comments

Very cool. However I have been using Railo for a while, and it also comes with a function htmlParse() to convert string to XML Doc.

Anyway, it is a bit annoying that when you have a XML Doc, you cannot really convert it back to original html string perfectly with function toString(xml) due to the fact that toString() gives the XML indentation and line breaks.

Rob Dudley Mar 5, 2012 at 3:33 PM

1 Comments

Hi Ben,

I know all the buzz is about CF10 right now but readers may be interested to know that your example works under CF9 using JavaLoader to load TagSoup.

I'm getting errors on larger documents but will need to test with CF10 to see if that's version specific or just an issue with the library.

Great post!

Rob

Mike Oct 12, 2012 at 6:24 PM

5 Comments

Ben I'm getting this error from your code.

Unable to find a constructor for class com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM that accepts parameters of type ( '' ).

I'm guessing the path in

var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

is incorrect for my server. How do I find out what is should be? Or what is my issue? Thanks

York Jun 13, 2013 at 9:32 PM

1 Comments

Hi Ben - you code works in CF10 Ent with a slight update to the line:
var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

Edit this to read ...init(true);

What I found with TagSoup is it fails when the DOM element has an ID tag.
For example <table id="mytable"> will result in an error.

Here is also a good article on comparisons:
http://www.benmccann.com/blog/java-html-parsing-library-comparison/

Thanks

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.