Skip to main content
Ben Nadel at NCDevCon 2011 (Raleigh, NC) with: Rakshith Naresh
Ben Nadel at NCDevCon 2011 (Raleigh, NC) with: Rakshith Naresh

Parsing Lists Using A RegEx Delimiter In Lucee CFML 5.3.8.201

By
Published in Comments (2)

In honor of yesterday's Regular Expression Day 2022 celebration, I wanted to play around with parsing lists in ColdFusion using a RegEx (Regular Expression) delimiter. Lists are the unsung heroes of the CFML language; and, are usually delimited by a single character (or set of single characters). But, the beauty of a list is that it's just a String; and, you can make a list out of anything using any delimiter. And, sometimes, I'd like that delimiter to be something more flexible, more dynamic. To start exploring this concept, I'm going to create jreListFirst() and jreListRest() functions.

ASIDE: The jre method name prefix is an homage to my JRegEx ColdFusion component project, which provides powerful ways to leverage Java's RegEx engine. Eventually, I might move these list-functions into that component.

To be honest, I wanted to publish this code yesterday in the RegEx Day post. However, I wasn't able to get it done in time. In fact, after failing to finish it yesterday, I started over from scratch this morning with a completely different approach.

In yesterday's approach, I was trying to be lazy about the list parsing. Meaning, I didn't want to parse the entire list, I only wanted to parse whatever prefix was relevant for a pattern-based listFirst() and listRest() implementation.

It was sort of going OK until I tried to add the includeEmptyFields argument. Trying to be lazy and take empty items into account some of the time was getting too complicated and hard to reason about internally. As such, my new approach this morning is centered around parsing the entire list into "segments". And then, consuming those segments.

This approach ultimately means that a lot of the parsing work is "wasted" in the average request. But, since most lists in ColdFusion are very small, the waste should be relatively innocuous.

The core underpinnings of my approach uses a function called jreSegmentList(). This function takes a list and a delimiter pattern and collects both the items and the matched delimiters into a single array. This function always includes empty items; and, defers to the calling context to filter those empty items out as needed:

<cfscript>

	/**
	* I split the given list into segments using the given delimiter RegEx pattern. Empty
	* items are collected between neighboring delimiters.
	*/
	public array function jreSegmentList(
		required string list,
		required string delimiterPattern
		) {

		if ( ! delimiterPattern.len() ) {

			throw(
				type = "JRE.EmptyPattern",
				message = "JRE pattern for list delimiter cannot be empty."
			);

		}

		var matcher = createObject( "java", "java.util.regex.Pattern" )
			.compile( delimiterPattern )
			.matcher( list )
		;
		var segments = [];

		// NOTE: Technically, CFML Strings are Java Strings; however, since we're going to
		// dip down into the Java layer methods, it's comforting to explicitly cast the
		// value to the native Java type if nothing else to provide some documentation as
		// to where those method are coming from.
		var input = javaCast( "string", list );
		var endOfLastMatch = 0;

		// Search for list delimiters.
		while ( matcher.find() ) {

			// If the start of the current match lines-up with the end of the previous
			// match, it means that there was no non-delimiter value between the two
			// matched delimiters. As such, we need to insert an empty-item.
			if ( matcher.start() == endOfLastMatch ) {

				segments.append({
					isItem: true,
					isDelimiter: false,
					text: ""
				});

			// If the start of the current match does NOT line-up with the end of the
			// previous match, it means we have a non-delimiter value to collect from
			// between the two matched delimiters.
			} else {

				segments.append({
					isItem: true,
					isDelimiter: false,
					text: input.substring( endOfLastMatch, matcher.start() )
				});

			}

			segments.append({
				isItem: false,
				isDelimiter: true,
				text: matcher.group()
			});

			endOfLastMatch = matcher.end();

		}

		// If the last match ended before the end of the input, it means that we have some
		// non-delimiter text at the end of the list to collect.
		if ( endOfLastMatch < input.length() ) {

			segments.append({
				isItem: true,
				isDelimiter: false,
				text: input.substring( endOfLastMatch )
			});

		// If the last delimiter match also hit the end of the input, it means that we
		// have to append an empty item after the last delimiter.
		} else {

			segments.append({
				isItem: true,
				isDelimiter: false,
				text: ""
			});

		}

		return( segments );

	}

</cfscript>

As you can see, I'm using Java's Pattern Matcher - the core technology in my JRegEx ColdFusion component - to iterate over the delimiters in the list. And, as I find delimiters, I add both the delimiters and the in-between items to the segments array. This algorithm also makes sure to prepend and append empty items to the lists if there is a leading or trailing delimiter, respectively.

If we run the jreSegmentList() function on a "single item" that has a leading and trailing delimiter:

dump( jreSegmentList( ",a,", "," ) );

... we get the following output:

A list parsed into segments with empty items automatically added between delimiters in ColdFusion.

As you can see, the jreSegmentList() collected an empty item before the leading delimiter and an empty item after the trailing delimiter. And, each segment has been marked as either a delimiter or an item.

With this ColdFusion User Defined Function (UDF) in place, building a pattern-based listFirst() and listRest() becomes significantly more simple. Here's the jreListFirst() implementation:

<cfscript>

	/**
	* I return the first item in the list using the given RegEx delimiter pattern.
	*/
	public string function jreListFirst(
		required string list,
		required string delimiterPattern,
		boolean includeEmptyFields = false
		) {

		for ( var segment in jreSegmentList( list, delimiterPattern ) ) {

			if ( segment.isItem && ( segment.text.len() || includeEmptyFields ) ) {

				return( segment.text );

			}

		}

		return( "" );

	}

</cfscript>

As you can see, the inclusion or exclusion of empty items is left up to the calling context. The jreSegmentList() function always includes them; then, whether or not they are meaningful is determined by the includeEmptyFields argument in the calling function.

Now, finding the first item in the list is as easy as looping over the parsed segments until we find one that has the .isItem property enabled.

The jreListRest() function is slightly more complicated. But, the mechanics are the same. Only, instead of looping until we find the first item, we going to loop over the entire set of segments, but filter-out every segment before the second item:

<cfscript>

	/**
	* I return the list without the first item (and its surrounding delimiters) using the
	* given RegEx delimiter pattern.
	*/
	public string function jreListRest(
		required string list,
		required string delimiterPattern,
		boolean includeEmptyFields = false
		) {

		var itemCount = 0;
		var restOfList = jreSegmentList( list, delimiterPattern )
			// Once we have the list segments, we need to start looking for items. We want
			// to drop the first item and its surrounding delimiters. As such, we only
			// want to start collecting segments when we find our SECOND item.
			.filter(
				( segment ) => {

					if ( segment.isItem && ( segment.text.len() || includeEmptyFields ) ) {

						itemCount++;

					}

					return( itemCount >= 2 );

				}
			)
			// Now that we have our collected segments, we need to collapse them back down
			// into a single string value. For that, we'll extract the individual segment
			// strings and then join them together.
			.map(
				( segment ) => {

					return( segment.text );

				}
			)
			.toList( "" )
		;

		return( restOfList );

	}

</cfscript>

Again, it's unfortunate that I am parsing the entire list before I attempt to either extract or exclude the first item; but, as you can see from the logic in these down-stream functions, it makes life much easier!

To help make sure I was getting the logic right, I had a few "unit tests" for the different functions (I'm excluding the UDFs here for sake of brevity):

<cfscript>

	// Testing some list-first patterns.
	dump( jreListFirst( "", "-" )                    == "" );
	dump( jreListFirst( "-------", "-" )             == "" );
	dump( jreListFirst( "-------", "-", true )       == "" );
	dump( jreListFirst( "--abc--", "-" )             == "abc" );
	dump( jreListFirst( "--abc--", "-+" )            == "abc" );
	dump( jreListFirst( "--abc--", "-", true )       == "" );
	dump( jreListFirst( "--abc--", "-+", true )      == "" );
	dump( jreListFirst( "abc----", "-" )             == "abc" );
	dump( jreListFirst( "abc----", "-", true )       == "abc" );

	// Testing some list-rest patterns.
	dump( jreListRest( "", "-" )                     == "" );
	dump( jreListRest( "-------", "-" )              == "" );
	dump( jreListRest( "-------", "-", true )        == "------" );
	dump( jreListRest( "-------", "-+", true )       == "" );
	dump( jreListRest( "--abc--", "-" )              == "" );
	dump( jreListRest( "--abc--", "-", true )        == "-abc--" );
	dump( jreListRest( "--abc--", "-+", true )       == "abc--" );
	dump( jreListRest( "abc----", "-" )              == "" );
	dump( jreListRest( "abc----", "-", true )        == "---" );
	dump( jreListRest( "----abc", "-" )              == "" );
	dump( jreListRest( "----abc", "-", true )        == "---abc" );
	dump( jreListRest( "--abc--def--", "-" )         == "def--" );
	dump( jreListRest( "--abc--def--", "-", true )   == "-abc--def--" );
	dump( jreListRest( "--abc--def--", "-+", true )  == "abc--def--" );
	dump( jreListRest( "abc", "-", true )            == "" );

</cfscript>

All of these dump() calls output true. All expected outcomes are achieved!

ASIDE: See Adam Tuttle and Adam Cameron, I do test my code occasionally ;)

I love Lists. I love Regular Expression pattern matching. So, it only makes sense that I would want to combine the two in ColdFusion. And now that I've created two lists function using the jreSegmentList() technique, I can easily see a path forward for implementing several of the other list methods in ColdFusion.

Want to use code from this post? Check out the license.

Reader Comments

Post A Comment — I'd Love To Hear From You!

Post a Comment

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel