Parsing HTML Natively With htmlParse() In Lucee 5.3.2.77

By Ben Nadel

Published 2019-07-07 in ColdFusion — Comments (7)

Parsing HTML isn't a task that I often have to perform during the "normal operation" of a ColdFusion application. However, parsing HTML can be a helpful feature when it comes to data migration. For example, when migrating from an old, HTML-based content management system (CMS) to a Markdown-based content management system. And, as someone who recently started using Flexmark and Markdown to author blog posts, this might be a migration that I attempt to take-on. In the past, I would have reached for TagSoup or jSoup to perform such parsing. But, it turns out that Lucee 5.3 now provides HTML-parsing natively with the htmlParse() function. This function accepts a string and returns an XML document.

The htmlParse() function is simple. You pass it a String; it returns an XML document. You can then traverse the returned XML data structure manually; or, you can use XPath to query for target elements. The complexity with the XPath option is that the htmlParse() function applies a Name Space to the document. Which means that instead of using simple queries like:

xmlDoc.search( "//p" )

... you have to use somewhat janky queries like this:

xmlDoc.search( "//*[ local-name() = 'p' ]" )

The degree to which this bothers you is strictly personal. That said, to make this exploration a bit more interesting, I wanted to create a wrapper function for htmlParse() that uses XSLT (XML Transforms) to remove name spaces from the resultant data structure.

I created a function called, htmlParseNoNamespaces(). To see it in action, I'm going to parse a simple (but invalid) HTML string and then query for the paragraphs:

<!---
	CAUTION: The htmlParse() function is very forgiving with invalid markup. But, it
	seems to work most consistently when there is a single ROOT NODE in the given markup.
	For example, in the following markup, if I remove the "body" wrapper, the structure
	of the parsed document completely changes, placing one SECTION element inside the
	other SECTION element (presumably because I am not closing the P-tag).
--->
<cfsavecontent variable="markup">

	<body class=dark-mode>
		<section>
			<!-- Testing some cool things. -->
			<p id=intro class='content'>This is very interesting!
		</section>
		<section>
			<p>I agree, this is <u class="em">player!</u>
			<p>But, will it <strong>work</strong>!</p>
		</section>
		<p>One wonders.
	</body>

</cfsavecontent>
<cfscript>

	doc = htmlParseNoNamespaces( markup );

	// Gather all of the text from the P-elements. Since the parsed HTML document is
	// returned as an XML document, we can use XPath to locate the paragraphs.
	paragraphs = doc
		.search( "//p" )
		.map(
			( node ) => {

				// For each paragraph, aggregate the string value of the entire node,
				// which will include all of the descendant nodes as well.
				return( node.search( "normalize-space( string( . ) )" ) );

			}
		)
	;

	dump( label = "P-Text", var = paragraphs );
	echo( "<br />" );
	dump( label = "HtmlDoc", var = doc.search( "//body" )[ 1 ] );

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* When the native htmlParse() function runs, it includes XML name-spaces which make
	* it much harder to search the subsequent document using XPath. This method strips
	* those XML name-spaces from the parsed document, allowing XPath to target node-names
	* more directly.
	* 
	* @htmlMarkup I am the HTML string being parsed.
	* @output false
	*/
	public any function htmlParseNoNamespaces( required string htmlMarkup ) {

		// To strip out the name-spaces, we're going to use XSLT (XML Transforms). The
		// following XSLT document will traverse the parsed HTML document and copy nodes
		// over to a new output string using only the node names.
		// --
		// Read More:  https://www.ibm.com/support/knowledgecenter/en/ssw_ibm_i_74/rzasp/rzaspxml4369.htm
		var removeNamespacesXSLT = trim('
			<xsl:stylesheet
				version="1.0"
				xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

				<xsl:output
					method="xml"
					version="1.0"
					encoding="UTF-8"
					indent="yes"
				/>

			 	<!-- Keep comment nodes. -->
			 	<xsl:template match="comment()">
					<xsl:copy>
						<xsl:apply-templates />
					</xsl:copy>
				</xsl:template>

				<!-- Keep element nodes. -->
				<xsl:template match="*">
					<!-- Remove element prefix. -->
					<xsl:element name="{ local-name() }">
						<!-- Process attributes. -->
						<xsl:for-each select="@*">
							<!-- Remove attribute prefix. -->
							<xsl:attribute name="{ local-name() }">
								<xsl:value-of select="." />
							</xsl:attribute>
						</xsl:for-each>

						<!-- Copy child nodes. -->
						<xsl:apply-templates />
					</xsl:element>
				</xsl:template>

			</xsl:stylesheet>
		');

		// In order to remove the name-spaces, we have to parse the document twice -
		// once to parse the HTML into an XML document. Then, once again to parse the
		// transformed XML string (less the name-spaces) back into an actual XML
		// document that we can search using XPath.
		return( xmlParse( htmlParse( htmlMarkup ).transform( removeNamespacesXSLT ) ) );

	}

</cfscript>

As you can see, I'm using the htmlParse() function to parse the incoming HTML string. Then, I use the .transform() method to remove the name spaces (using the given XSLT document), which produces a serialized XML string. I then parse that XML string back into an XML Document which be searched easily with XPath and simple Element selectors.

Now, when I run this Lucee CFML document, I get the following output:

Parsing HTML with Lucee 5.3 can be queried with XPath.

As you can see, one the non-name-space XML document is produced, I can use a simple XPath query like //p to search for the Paragraph nodes. I then use a .map() method to map the P-nodes onto String values.

Again, parsing HTML isn't something that I have to do very often in my ColdFusion code. That said, it's really cool that parsing HTML is now a native feature of Lucee 5.3.2.77. It will making the few times that I do need the functionality all that much easier to consume.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/3650

Reader Comments

Charles Robertson Jul 7, 2019 at 1:46 PM

462 Comments

This is amazing. I actually never got on too well with jSoup, so I am looking forward to using:

htmlParse()

And, I actually like using 'XPath'. It's also great that it still parses everything, even when some of the paragraph tags are not closed.
Impressive stuff...

Ben Nadel Jul 8, 2019 at 9:45 AM

15,996 Comments

@Charles,

Pretty exciting, right? One of the things that I would like to be able to do with it eventually is extract fenced-code-blocks from my markdown content. I explored this idea this morning:

www.bennadel.com/blog/3651-dynamically-loading-java-classes-from-jar-files-using-createobject-in-lucee-5-3-2-77.htm

Having to use XPath is not great, when compared with somethings like CSS-based selectors. But, for small things, like my aforementioned exploration, it's a perfectly enjoyable option.

Christian Nov 21, 2019 at 1:37 PM

1 Comments

Hi Ben,

I read this article with great interest and it brings me to a thought about a problem that has been bothering me for a long time.

There is a table. Rows (first column) are attribute of something, the columns (first row) are numbers / ratings. The position of an X in the table determines the rating of an attribute.

Is it possible to extract the information out of the table like that?

Attribute A - 2
Attribute B - 3
Attribute C - 1
Attribute D - 2

Ben Nadel Jan 8, 2020 at 7:08 AM

15,996 Comments

@Christian,

Sorry for the late reply. But, I assume (without knowing the exact data model here), that you could .map() the rows onto the given structure you are looking for. So, some pseudo-code for this might look like:

var data = htmlParse( your_html ).search( "//tr" ).map(
	( row ) => {
	
		// Find the values in each row that you are looking for.
		var attribute = row.search( "./....." ).xmlText;
		var rating = row.search( "./....." ).xmlText;
		
		return({
			attribute: attribute,
			rating: rating
		});
	
	}
);

I don't know what those internal .search() values would actually look like since I don't know what kind of HTML you are working with. But, essentially, each .map() iteration would give you access to the parsed TR row. Then, you can perform a .search() off of the TR context to get your specific values (and maybe apply some different logic to translate locations to ratings).

Hopefully that helps a little bit.

Ben Nadel Mar 18, 2020 at 7:42 AM

15,996 Comments

@All,

After working extensively with htmlParse() over the weekend, I stumbled upon something interesting - escaped HTML entities are no longer escaped when I read them out through the .xmlText property of the resultant XML document:

www.bennadel.com/blog/3789-reading-xmltext-values-from-the-xml-document-produced-by-htmlparse-in-lucee-cfml-5-3-4-80.htm

.... I had to get around this by using the toString() function on the given TEXT_NODE and then stripping off the XML DOCTYPE.

Brad Wood Mar 16, 2022 at 5:17 PM

15 Comments

Excellent information. I just ran into this myself thinking, "Hey, I'll use HTMLParse() and XMLSearch() to do some quick HTML work", but then I was surprised when my xpaths didn't return any results.

This would be a nice feature in Lucee to have an option to remove the namespaces for you. Oooh, look-- there's already a ticket!
https://luceeserver.atlassian.net/browse/LDEV-839
I'll put a link to this blog post there.

Brad Wood Mar 16, 2022 at 5:59 PM

15 Comments

Whoa, look at this weird behavior I found while trying to re-parse the cleaned XML. It appears that when you use the XMLParse() BIF in Lucee to parse an XML document that has a root element of html and a nested element of head, Lucee will ADD an invalid meta tag (not self-closing) to the XML that wasn't there. This of course, will throw an error if you convert the XML to a string and then try to re-parse it!

CFSCRIPT-REPL: toString( XMLParse( '<html><head/></html>' ) )
<html>
    <head>
        <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
    </head>
</html>

Note, I'm not using HTMLParse() in that example, so I wouldn't expect Lucee to do ANYTHING related to HTML. When I cfdump the parsed XML, the meta tag isn't there, so it seems to get added as part of the toString() from what I can tell.

I have added a ticket here:
https://luceeserver.atlassian.net/browse/LDEV-3913

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.