Generating Pandoc Heading Identifiers In ColdFusion

By Ben Nadel

Published 2023-12-06 in ColdFusion — Comments (2)

Over on my Feature Flags book website, I'm using my book's Markdown content to generate the HTML for the page. I then use jSoup to inject a table of contents (TOC); which requires that I insert an identifier into each header element. And, now that I'm trying to use Pandoc to generate an EPUB (digital book) version, I need to make sure that my ColdFusion-based header identifiers match the ones that Pandoc will generate in the final EPUB.

The Pandoc documentation on "Headings and Sections" describes the algorithm that it uses to generate the heading identifiers:

Remove all formatting, links, etc.
Remove all footnotes.
Remove all non-alphanumeric characters, except underscores, hyphens, and periods.
Replace all spaces and newlines with hyphens.
Convert all alphabetic characters to lowercase.
Remove everything up to the first letter (identifiers may not begin with a number or punctuation mark).
If nothing is left after this, use the identifier "section".

The Pandoc documentation also provides a set of sample headings and the identifiers that it will generate. We can use these samples to test our ColdFusion algorithm. And, of course, we'll make ample use of Regular Expressions to solve this problem.

In the following ColdFusion code, we're looping over the samples provided by Pandoc and asserting that our ColdFusion-generated identifier matches the expected identifier:

<cfscript>

	// These values are provided in the Pandoc documentation on Headings and Sections.
	assertions = [
		{
			heading: "Heading identifiers in HTML",
			identifier: "heading-identifiers-in-html"
		},
		{
			heading: "Maître d'hôtel",
			identifier: "maître-dhôtel"
		},
		{
			heading: "*Dogs*?--in *my* house?",
			identifier: "dogs--in-my-house"
		},
		{
			heading: "[HTML], [S5], or [RTF]?",
			identifier: "html-s5-or-rtf"
		},
		{
			heading: "3. Applications",
			identifier: "applications"
		},
		{
			heading: "33",
			identifier: "section"
		}
	];

	// Let's test the Pandoc header assertions against our ColdFusion algorithm, yay!
	for ( assertion in assertions ) {

		identifier = generateIdentifier( assertion.heading );

		writeOutput("
			<p>
				Heading: #encodeForHtml( assertion.heading )# <br />
				Expected: #encodeForHtml( assertion.identifier )# <br />
				Received: #encodeForHtml( identifier )# <br />
				Pass: <b>#yesNoFormat( assertion.identifier == identifier )#</b>
			</p>
		");

	}

	// ------------------------------------------------------------------------------- //
	// ------------------------------------------------------------------------------- //

	/**
	* I generate a Pandoc section identifier (ie, URL anchor) from the given heading text.
	* 
	* ASSUMPTION: For this demo, I am assuming that all formatting, links, and footnotes
	* have already been removed and that we are dealing with plain-text header values.
	*/
	public string function generateIdentifier( required string heading ) {

		var identifier = heading
			.trim()
			// Convert all alphabetic characters to lowercase.
			.lcase()

			// Replace all spaces and newlines with hyphens.
			.reReplace( "\s+", "-", "all" )

			// Remove all non-alphanumeric characters, except underscores, hyphens,
			// and periods.
			.reReplace( "[^\w.-]+", "", "all" )

			// Remove everything up to the first letter (identifiers may not begin with
			// a number or punctuation mark).
			.reReplace( "^[^a-z]+", "" )
		;

		// If nothing is left after this, use the identifier section.
		if ( ! identifier.len() ) {

			return( "section" );

		}

		return( identifier );

	}

</cfscript>

As a general rule, when using Regular Expressions to solve a problem, always move the "convert to lowercase" step as high-up in the algorithm as you can. That way, you can simplify your patterns by using [a-z] instead of [a-zA-Z]; and, you can use .reReplace() instead of .reReplaceNoCase(), which will be more efficient.

In this ColdFusion code, I've used Pandoc's description of each step as a comment in the code so that you can see how each RegEx pattern maps to Pandoc's intended outcome. If Regular Expressions seem like a foreign language to you, check out my video presentation on basic pattern usage. Once you start using patterns, you'll find that they improve the quality of your developer life.

With that said, if we run this ColdFusion code, we get the following output:

Output of header identifier assertions showing that ColdFusion generated the correct values.

As you can see, the heading identifiers generated by our ColdFusion Regular Expression replacements match the identifier assertions provided by Pandoc. At this point, I can update my Feature Flags site logic and not worry about the inter-chapter links breaking when I generate my EPUB.

Note: My Feature Flags site uses Flexmark to convert from Markdown to HTML in ColdFusion (during site bootstrapping and initialization); which is why the two algorithms need to be aligned. This way, I neither need to install Pandoc on my server nor do I need to commit the generated HTML to my source control.

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/4537

Reader Comments

Ben Nadel Dec 6, 2023 at 1:12 PM

15,996 Comments

One thing that is rather annoying about this algorithm (from a purely superficial standpoint) is that it leaves . in the identifier. For example, I have a heading:

Build vs. Buy

... and Pandoc generates the identifier as:

build-vs.-buy

Notice the embedded . in the ID. Frankly, this just looks sloppy to me. I know this is crazy, but I'm actually considering removing the . from my heading in order to "fix" the generated ID. But, this is silly since the ID won't even be visible in the book (only in the website).

Uggg, we shallow, superficial humans.

Ben Nadel Dec 6, 2023 at 1:17 PM

15,996 Comments

Re: "Uggg, we shallow, superficial humans." -- the most recent Hidden Brain podcast episode talked about this innate human desire to seek out beauty:

https://hiddenbrain.org/podcast/the-mystery-of-beauty/

They believe it might actually be an evolutionary trait. That we are naturally drawn to beautiful things because they are a sign of health and longevity. They also talk about how some scientific breakthroughs have been made by assuming that the natural world is beautiful; and, that it would have underlying formulas and maths that were also beautiful.

All to say, the fact that I am quite aggravated by a . in my ID-token isn't so much silly as it is a sign that I am deeply human. And, attracted to beautiful things.

Reader Comments

Post A Comment — ❤️ I'd Love To Hear From You! ❤️

Post A Comment — I'd Love To Hear From You!