Avoiding Self-Closing IFRAME Tags Using htmlParse() In Lucee CFML 5.3.4.80
Over the past week, I've been working to retrofit Markdown onto all of my old blog content using Lucee CFML. It's been an exciting journey with a lot of trial and error. For example, the other day, I realized the .xmlText
property wasn't giving me escaped HTML entities; and, just this morning, I realized that iframe
tags with no content were getting re-serialized as self-closing tags. While this is valid for XML - any tag with no children can be self-closing - only certain tags in HTML can be self-closing. And, the iframe
is not one of them. As such, I had to re-process all of my posts, ensuring that iframe
tags were serialized using both an Open and Close tag in Lucee CFML 5.3.4.80.
To see the issue I was running into, let's look at a stand-alone example. In the following ColdFusion code, we're going to parse an HTML snippet using htmlParse()
. And then, simply serialize it back to HTML using toString()
:
<cfscript>
```
<cfsavecontent variable="htmlContent">
<p>
Heck, checkout this video:
</p>
<p>
<iframe src="video.mp4"></iframe>
</p>
</cfsavecontent>
```
// The htmlParse() function parses the HTML into an XML document. The rules for XML
// documents are different than the rules for HTML documents. This can cause a
// re-serialization problem for non-self-closing tags with empty-content.
xmlContent = htmlParse( htmlContent );
// Because the IFRAME element has no child-nodes, stringification of the XML document
// will render the IFRAME as a SELF-CLOSING tag. This is valid for XML but is NOT
// valid for HTML.
echo( encodeForHtml( toString( xmlContent.html.body ) ) );
</cfscript>
As you can see, the HTML content being parsed contains an iframe
tag with no children:
<iframe src="video.mp4"></iframe>
And, when we serialize this using toString()
, we get the following markup (I've manually added white-space to make it more readable):
<?xml version="1.0" encoding="UTF-8"?>
<body xmlns="http://www.w3.org/1999/xhtml">
<p>
Heck, checkout this video:
</p>
<p>
<iframe frameborder="1" scrolling="auto" src="video.mp4"/>
</p>
</body>
As you can see, the iframe
tag is being serialized as a self-closing tag, in that it now ends with />
rather than with </iframe>
. If I were to try and get the browser to render this iframe
, the page would break. It wouldn't throw an error, it would simply hit the <iframe/>
tag and stop rendering the rest of the page output.
NOTE: Literally, as I am writing this, I am just noticed that the
htmlParse()
method seems to have injectedframeborder
andscrolling
attributes into myiframe
tag.
To get around this, I have to force the iframe
tag to have at least one child-node. If it has one child node, then the toString()
call will correctly render it with the </iframe>
closing tag.
The easiest way I can think of to do this is to simply append an empty HTML comment to the iframe
content. This shouldn't have any bearing on the visual rendering of the page; but, it will force the iframe
tree-fragment to be non-empty. I'm going to do this before I run the HTML content through htmlParse()
:
<cfscript>
```
<cfsavecontent variable="htmlContent">
<p>
Heck, checkout this video:
</p>
<p>
<iframe src="video.mp4"></iframe>
</p>
</cfsavecontent>
```
// In order to get IFRAME tags to re-serialize with the desired, two-tag format, we
// have to ensure that the IFRAME contains at least one child-node. In this case, we
// can use the innocuous COMMENT node to force children.
htmlContent = htmlContent
.reReplaceNoCase( "></iframe>", "><!-- --></iframe>", "all" )
;
// With the inserted COMMENT, our IFRAME element in the resultant XML document will
// no longer be empty.
xmlContent = htmlParse( htmlContent );
// ... which means, when re-serialized, it will render as <iframe>....</iframe>.
echo( encodeForHtml( xmlContent.html.body ) );
</cfscript>
As you can see, before I call htmlParse()
, I'm finding any iframe
closing tag that butts-up against another tag artifact (angle bracket) and I'm inserting an empty HTML comment. Now, when we re-serialize the content using the toString()
function, we get the following markup (again, I've manually added white-space to make it more readable):
<?xml version="1.0" encoding="UTF-8"?>
<body xmlns="http://www.w3.org/1999/xhtml">
<p>
Heck, checkout this video:
</p>
<p>
<iframe frameborder="1" scrolling="auto" src="video.mp4"><!-- --></iframe>
</p>
</body>
As you can see, because we force the iframe
tag to have at least one child node, it now gets re-serialized with the </iframe>
closing tag.
To be clear, I'm talking about the iframe
tag in this case because that's the tag that caused my page-rendering issues. However, this same rule applies to any HTML tag that has no children. Of course, tags like img
and meta
are allowed to be self-closing and won't be a problem. It just happens that the iframe
tag will break the page if it self-closing.
Ultimately, there may be other ways to deal with HTML parsing and sanitization; such as by using XSLT and xmlTransform()
in ColdFusion. However, htmlParse()
feels like a nice combination of ease-of-use and powerful functionality. It just happens that it has caveats that you have to watch out for in Lucee CFML.
Want to use code from this post? Check out the license.
Reader Comments
Hi, I had the same problem. Self-closing iframes are invalid HTML. I wonder if there is a bug report anywhere.
I found this solution also very helpful: https://stackoverflow.com/q/41890415/1337474
Thanks for sharing!
@Johannes,
I don't know about any bug reports. The core of the problem is that Lucee is parsing HTML into XML, and XML just has different rules about what is and is not valid. Lately, I've been using jSoup to do the HTML parsing. Yes, it's a 3rd-party library that I now have to pull-in; but, it was designed to parse and render HTML specifically, so it's knows how to do the right thing.
I was actually just using it the other day to fix some names before rendering some stored content. You might just be curious to see it in action:
www.bennadel.com/blog/4315-using-jsoup-to-fix-post-marriage-name-changes-in-coldfusion-2021.htm