Understanding RegExp Capture Groups When Using .split() In JavaScript
Yesterday, I was trying to take a plain-text value and split it into paragraphs using a regular expression in JavaScript. At first, it seemed to be working. But, on closer inspection of the rendered output, I notice that I was inserting an empty paragraph in between each populated paragraph. After 30 minutes of debugging and looking through the MDN documentation, I realized that I had an incomplete mental model for how String.prototype.split()
works when using a regular expression delimiter.
In my first attempt at splitting the input text into paragraphs, I was using this RegExp pattern:
(\r\n?|\n)
This is a pretty standard regular expression pattern that attempts to account for both Windows based and *nix-based line delimiters. So, at first, I was quite befuddled when my .split()
call wasn't "just working".
As a random debugging effort, I tried to remove the parenthesis:
\r\n?|\n
And, suddenly, my empty paragraphs were gone! It didn't make any sense. Until I saw this line in the MDN docs:
For each match, the substring between the last matched string's end and the current matched string's beginning is first appended to the result array. Then, the capturing groups' values are appended one-by-one.
It turns out that captured groups are included in the .split()
result as individual array elements. Let's see this in action. In the following test, I'm taking a single input and I'm splitting it using the "same delimiter" with an increasing number of captured groups.
var input = "a,b,c:1,2,3";
// No captured group in pattern. Results contain ONLY the separated segments.
console.log( input.split( /[,:]/ ) );
// All delimiters in a single captured group.
console.log( input.split( /([,:])/ ) );
// Each delimiter in its own captured group.
console.log( input.split( /(,)|(:)/ ) );
// All delimiters AND each delimiter in its own captured group.
console.log( input.split( /((,)|(:))/ ) );
In all cases, we're splitting the input string on either ,
or :
. But, in each subsequent .split()
call, we're organizing the delimiter pattern with different capture groups. Here's what we get:
// Pattern: /[,:]/
[ 'a', 'b', 'c', '1', '2', '3' ]
Without any capture groups, all we get are the split segments.
// Pattern: /([,:])/
[
'a',
',', // Delimiter.
'b',
',', // Delimiter.
'c',
':', // Delimiter.
'1',
',', // Delimiter.
'2',
',', // Delimiter.
'3'
]
With a single capture group, we get the captured delimiter appended after each split.
// Pattern: /(,)|(:)/
[
'a',
',', // First capture group.
undefined, // Second capture group.
'b',
',', // First capture group.
undefined, // Second capture group.
'c',
undefined, // First capture group.
':', // Second capture group.
'1',
',', // First capture group.
undefined, // Second capture group.
'2',
',', // First capture group.
undefined, // Second capture group.
'3'
]
With two capture groups, each capture group is appended after each split, even when it results in a non-match.
// Pattern: /((,)|(:))/
[
'a',
',', // FULL capture group.
',', // First delimiter capture.
undefined, // Second delimiter capture.
'b',
',', // FULL capture group.
',', // First delimiter capture.
undefined, // Second delimiter capture.
'c',
':', // FULL capture group.
undefined, // First delimiter capture.
':', // Second delimiter capture.
'1',
',', // FULL capture group.
',', // First delimiter capture.
undefined, // Second delimiter capture.
'2',
',', // FULL capture group.
',', // First delimiter capture.
undefined, // Second delimiter capture.
'3'
]
As you can see, them more capture groups we add in our regular expression pattern, the longer are .split()
results get.
Now that we know this, we can go back to the paragraph splitting behavior and create an algorithm that filters-out the delimiters from the result:
console.log(
breakIntoParagraphs( "Lorem Ipsum\n\n\n\nDollar sit\n\nBacon yum." )
);
function breakIntoParagraphs( input ) {
return input
// This split will include the line delimiters in the result.
.split( /(\r\n?|\n)+/ )
// Filter-out the line delimiters (which are nothing but white space).
.filter(
( segment ) => {
return segment.trim();
}
)
;
}
I'm sure there are good use-cases for this behavior. But, unless you know how it works ahead of time, this behavior can very easily lead to bugs. Hopefully I will remember this caveat going forward.
Want to use code from this post? Check out the license.
Reader Comments
Ha ha—it turns out that I actually wrote about this back in 2017 (seven years ago!):
You Can Include Delimiters In The Result Of JavaScript's String .split() Method When Using Capturing Groups
So, at the end of this current post when I say, "Hopefully I will remember this caveat going forward", it doesn't bode well. Apparently I wasn't able to remember it from last time.
I considered removing / unpublishing this current post; but, I figure it still might be helpful to have it out there.
Post A Comment — ❤️ I'd Love To Hear From You! ❤️
Post a Comment →