Maintaining White Space Using jSoup And ColdFusion
jSoup is a Java library for parsing and manipulating HTML strings. For the last few years, I've been using jSoup to clean-up and normalize my blog posts. And now, I'm looking to use jSoup to help me transform and cache GitHub Gists. At the time of this writing, Gist code is rendered in an HTML <table>
with cells that use white-space: pre
as the means of controlling white space output. jSoup doesn't parse the CSS; so, it does understand that it needs to maintain this white space when serializing the document back into HTML. If we want to keep this white space in the resultant document, we have to disable pretty printing.
ASIDE: jSoup will naturally maintain white space that is contained within a
<pre>
tag. However, that doesn't apply to elements usingwhite-space: pre
CSS properties.
The pretty print settings control how white space is handled within the .html()
and .text()
methods. These methods can be used to access parts of the jSoup Document Object Model (DOM); and, are used internally during the serialization process.
The pretty print settings are defined at the Document level and can be accessed at:
document.outputSettings()
This object provides a getter / setter for the pretty printing:
outputSettings.prettyPrint( [ boolean ] )
In order to disable pretty printing and maintain the original white space, we have to invoke this method with (false)
before we serialize our document. To see this in action, I'm going to parse a Paragraph tag that contains leading and trailing white space. Then, I'll serialize the resultant document: once with pretty printing and then once after pretty printing has been disabled:
<cfscript>
// Note that our inner content is surrounded by leading / trailing spaces.
input = "<p> Some content with spaces </p>";
document = javaNew( "org.jsoup.Jsoup" )
.parseBodyFragment( input )
;
// Let's update the document content (to demonstrate that we have reason to parse and
// then re-serialize the content).
document.selectFirst( "p" )
.attr( "data-edited", "true" )
;
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
// By default, pretty printing is enabled within the document. This means, when we go
// to serialize the document as HTML, it will normalize all the text. Which means,
// any "unnecessary" leading / trailing spaces will be trimmed.
writeOutput( "<h2> Pretty Print Enabled </h2>" );
renderDocumentAsPre( document );
// When we disable pretty printing, jSoup will leave all the text nodes AS IS, even if
// they aren't strictly necessary.
document.outputSettings()
.prettyPrint( false )
;
writeOutput( "<h2> Pretty Print Disabled </h2>" );
renderDocumentAsPre( document );
// ------------------------------------------------------------------------------- //
// ------------------------------------------------------------------------------- //
/**
* I render the given jSoup document as an escaped markup within PRE tags.
*/
public string function renderDocumentAsPre( required any document ) {
writeOutput(
"<pre>" &
encodeForHtml( document.body().html() ) &
"</pre>"
);
}
/**
* I create a new Java class wrapper using the jSoup JAR files.
*/
public any function javaNew( required string className ) {
var jarPaths = [
expandPath( "./jsoup-1.16.1.jar" )
];
return( createObject( "java", className, jarPaths ) );
}
</cfscript>
Essentially, this ColdFusion code is taking the jSoup DOM and calling .html()
on it in order to serialize the DOM back into an HTML string. It's doing this twice, once before and once after the pretty printing has been disabled. And, when we run this ColdFusion code, we get the following output:
As you can see, the first serialization of the jSoup DOM resulted in stripped-out white space. However, after we disabled pretty printing, the second serialization of the jSoup DOM leaves our white space in tact.
Want to use code from this post? Check out the license.
Reader Comments
Here's a PR from the jSoup repository that examines the white-space expectation depending on the pretty-print setting. This is where I found out about this setting.
As a fast-follow, here's another post in which I'm using jSoup to parse and transform GitHub Gist data into a consumable data structure. This is where the
.prettyPrint(false)
comes into play:www.bennadel.com/blog/4464-parsing-github-gist-embeds-into-a-normalized-data-structure-using-jsoup-in-coldfusion.htm
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →