Ask Ben: Breaking An SMS Text Message Up Into Multiple Parts
This isn't necessarily an "Ask Ben" question; Michael Appenzellar had brought up the concept of breaking up an SMS text message into multiple parts that were, at most, 120 characters each. He was having a bit of trouble breaking it up, so I thought I would throw together a quick little demo. To start with, let's create the text message that you might want to send to someone around this time of the year:
<!---
Store the message that we would like to split up
into MAX:120 character segments.
--->
<cfsavecontent variable="strMessage">
Deborah, thank you so much for coming over for Christmas
celebrations. I had quite a fabulous time. I hope that
the present I got for you was not offensive; I just fancy
you rather attractive and I could only imagine that that
kind of outfit would have looked insanely delicious on you.
Happy Holidays.
</cfsavecontent>
<!---
Clean the message - trim it and replace out special
characters (line breaks, tabs, carriage returns) with
a space.
--->
<cfset strMessage = REReplace(
Trim( strMessage ),
"[\t\r\n\s]+",
" ",
"all"
) />
Don't pay attention to that REReplace() - that just takes the string stored using ColdFusion's CFSaveContent tag and strips out the extra tabbing and line breaks. I just like using CFSaveContent for formatting / display reasons.
Ok, now that we have our message, we want to break it up into 120 max-character SMS text messages. Initially, you might just try to use ColdFusion's Mid() function to grab every 120 character substring of the message:
<!--- Break the message into 120 character strings. --->
<cfloop
index="intOffset"
from="1"
to="#Len( strMessage )#"
step="120">
<!--- Output this max:120 character segment. --->
<p>
#Mid( strMessage, intOffset, 120 )#
</p>
</cfloop>
On paper, this looks good, but when you run it, you see that it's not quite ideal. We end up splitting the message up into these three segments:
Deborah, thank you so much for coming over for Christmas celebrations. I had quite a fabulous time. I hope that the pres
ent I got for you was not offensive; I just fancy you rather attractive and I could only imagine that that kind of outfi
t would have looked insanely delicious on you. Happy Holidays.
As you can see, the word "present" in the first line and the word "outfit" in the second line are split between two SMS text messages. The problem here is that Mid() has no context; it has no understanding of the problem in which it is being used. As such, it doesn't care about splitting words.
Now, you could take that and start adding a bunch of logic to back track characters until you hit a space and then adjust your start offset and stuff. That can all get sticky. The easier approach is to leverage the robust rules that can be applied using Regular Expressions. We can think of our SMS message segments as consisting of a pattern and that pattern is that the captured match must be at most 120 words and must end on an appropriate character (meaning, it cannot end in the middle of a word).
I am going to arbitrarily say that a word is considered "split in half" if the next matched character is NOT a space, dash, colon, or "end of string" character. Anything that does not follow this rule must remain grouped together. To apply this kind of pattern rule, we are going to use a positive look ahead:
.{1,120}(?=([\s\-:]|$))
Now, using that pattern in conjunction with ColdFusion 8's new REMatch() function makes this almost too easy:
<!---
Get a 120 limit character pattern using regular
expression. This is a pattern that can match upto
120 characters and MUST be followed by an acceptable
word boundry.
--->
<cfset arrSegments = REMatch(
".{1,120}(?=([\s\-:]|$))",
strMessage
) />
<!--- Output the segments returned in the array. --->
<cfloop
index="strSegment"
array="#arrSegments#">
<p>
#Trim( strSegment )#
</p>
</cfloop>
Running this code, we get the following, more appropriate output:
Deborah, thank you so much for coming over for Christmas celebrations. I had quite a fabulous time. I hope that the
present I got for you was not offensive; I just fancy you rather attractive and I could only imagine that that kind of
outfit would have looked insanely delicious on you. Happy Holidays.
Notice that this time, both the words "present" and "outfit" remain in tact, but moved completely to the next SMS text message. Works like a charm. And, since regular expression pattern matching always picks up where it left off, you never have to worry about word wrapping conflicting with the next segment match.
I hope that helps in some way.
Want to use code from this post? Check out the license.
Reader Comments
Nice! I'd been looking for something like this a while back. I'll keep this code handy :)
@Gareth,
No problem.
Whenever I see your code, and particularly RegExp, I get amazed, and feel you are writing magic.
Really I don't know how this:
.{1,120}(?=([\s\-:]|$))
will be interpreted to find the "end of string" character within limit of 120 characters !!
especially this face ... I mean part:
\-:
joking ... :)
But seriously I think I will never come with this solution to fix the word splitting bug. As I know my self, immediately I will use cfloop and basic cfif to check and do it. Any time I start learning RegExp I feel it difficult and get lazy to continue.
Any Way thanks a lot for your very interesting blog. I hope you always keep improved, and I wish if I can keep up your posts.
:)
@Ameen,
Thanks man. If you ever need help with any regular expression stuff, just let meknow.
This regex IS awesome. Off to http://www.regular-expressions.info/lookaround.html to learn more...
Apparently -- Ben, please correct me if I'm wrong -- Ben is using a "positive lookahead" to match at least one and at most 120 consecutive characters followed by an "acceptable word boundry". An "acceptable word boundry" is a space (\s) or a dash (-) or a colon (:) or end of string ($). REMatch then returns an array of all matched occurrences. The - (hyphen) has to be escaped because it's a special character. So we have the smiley face \-:.
regular-expressions.info explains this on a very siple example: q(?=u) matches a q that is followed by a u, without making the u part of the match. Therefore, in Ben's regex, the "acceptable word boundry" is not returned, and we get at most 120 characters in each match. The "acceptable word boundry" that delimited the previous match is included in the next because it falls under the dot (which means "anything") metacharacter.
Another note: dot will not match a newline, but as Ben has already removed all the newlines, it is a non issue. However, the example will not work without removing the newlines.
Regular expressions make my brain hurt...
@Nikola,
Correct, I am using a positive look-ahead. They are very powerful. The site you mention, regular expression info is very very good. I refer to it all the time.