Converting Numbers To A Character-Set Based Radix Using ColdFusion
Earlier this morning on Twitter, someone asked about why "tiny url" sites used both numbers and letters in their short URLs. I responded that using both numbers and letters gives a much greater set of possible values as there are more possibilities for each digit. After this micro conversation, I thought it would be fun to try and create a tiny URL service as I don't really know what the best approach would be.
When I sat down and started to think about it, I realized that it would be hugely beneficial to be able to create a tiny URL based on a numeric index in a database. Furthermore, when it comes to looking up a tiny URL, it would be awesome to be able to convert that tiny URL back into its original numeric value. As I was contemplating this, I realized that what I might be doing is converting a base-10 number to a number of a different base (defined by our alpha/numeric character set).
ColdFusion has some great utility methods for base conversions - FormatBaseN() and InputBaseN(). But, those methods only go up to a 36-radix (26 characters + 10 numbers = 36 characters). The reason for this is that its base conversions are case-insensitive, such that "A" is the same as "a". Unfortunately, as I wanted to use both upper and lower case characters as different characters for a greater variety, I couldn't make use of ColdFusion's built-in conversion methods.
So then, I started to think about other characters that might be used. In the past, I have built tiny URLs that have "-" characters and "_" characters. How to I use that in a base conversion? That's when it dawned on me! The radix that I am using is not defined by a standard set of data; rather, my radix is defined by an arbitrary character set. And so, I set about creating two ColdFusion user defined methods that would convert a value based on a passed-in character array in which the index of each character would represent the decimal value of that particular character.
To align more naturally with ColdFusion's built-in methods, I named them similarly:
- FormatBaseNData( decimalValue, characterArray )
- InputBaseNData( formattedValue, characterArray )
Here is the code for these ColdFusion user defined functions:
<cffunction
name="FormatBaseNData"
access="public"
returntype="string"
output="false"
hint="I take a demical value and then convert it to the number set defined by the passed in array.">
<!--- Define arguments. --->
<cfargument
name="Value"
type="numeric"
required="true"
hint="I am the numberic value that is being converted to a new radix."
/>
<cfargument
name="CharacterSet"
type="array"
required="true"
hint="I am the character set used in the conversion. Each character represents the number at it's array index."
/>
<!--- Define the local scope. --->
<cfset var LOCAL = {} />
<!--- Create the base string to be returned. --->
<cfset LOCAL.EncodedValue = "" />
<!---
Get the length of our array. This will be our radix for
the conversion. NOTE: Because ColdFusion arrays start at
1, not zero, we will have to some manual offsetting when
we perform the conversion.
--->
<cfset LOCAL.Radix = ArrayLen( ARGUMENTS.CharacterSet ) />
<!---
Get a local copy of our value as we will be updating it
as we divide into it.
--->
<cfset LOCAL.Value = ARGUMENTS.Value />
<!---
When converting to a new radix, we need to keep dividing
the value passed in until we hit zero (which will never
have a remainder). However, because we always want to
perform at least ONE division, we will break from within
the loop if it hits zero rather than check for that
loop conditional.
--->
<cfloop condition="true">
<!--- Get the division result. --->
<cfset LOCAL.Result = Fix( LOCAL.Value / LOCAL.Radix ) />
<!--- Get the remainder of radix division. --->
<cfset LOCAL.Remainder = (LOCAL.Value MOD LOCAL.Radix) />
<!---
Take the remainder and prepend the Radix-converted
string to the encoded value. Remember, since we are
using arrays that start at 1, we need to add one to
this value.
--->
<cfset LOCAL.EncodedValue = (
ARGUMENTS.CharacterSet[ LOCAL.Remainder + 1 ] &
LOCAL.EncodedValue
) />
<!---
Now that we have gotten the current, store the result
value back into the value.
--->
<cfset LOCAL.Value = LOCAL.Result />
<!---
Check to see if we have any more value to divide into.
Once we hit zero, we are out of possible remainders.
--->
<cfif NOT LOCAL.Value>
<cfbreak />
</cfif>
</cfloop>
<!--- Return the encoded value. --->
<cfreturn LOCAL.EncodedValue />
</cffunction>
<cffunction
name="InputBaseNData"
access="public"
returntype="string"
output="false"
hint="I take an encoded value and convert it back to demical based on the passed in character array.">
<!--- Define arguments. --->
<cfargument
name="Value"
type="string"
required="true"
hint="I am the encode value that is being converted back to a base 10 number."
/>
<cfargument
name="CharacterSet"
type="array"
required="true"
hint="I am the character set used in the conversion. Each character represents the number at it's array index."
/>
<!--- Define the local scope. --->
<cfset var LOCAL = {} />
<!--- Create the base number to be returned. --->
<cfset LOCAL.DecodedValue = 0 />
<!---
Get the length of our array. This will be our radix for
the conversion. NOTE: Because ColdFusion arrays start at
1, not zero, we will have to some manual offsetting when
we perform the conversion.
--->
<cfset LOCAL.Radix = ArrayLen( ARGUMENTS.CharacterSet ) />
<!---
Convert our character set to a list so that we can easily
get the numeric value of our encoded digit.
--->
<cfset LOCAL.CharacterList = ArrayToList( ARGUMENTS.CharacterSet ) />
<!---
Reverse the string that was passed in. We are doing this
because the right-most value is actually the smallest
place and it will be easier for us to deal with in reverse.
--->
<cfset LOCAL.Value = Reverse( ARGUMENTS.Value ) />
<!---
Now, break the value up into an array so that we can more
easily iterate over it.
--->
<cfset LOCAL.ValueArray = ListToArray(
REReplace(
LOCAL.Value,
"(.)",
"\1,",
"all"
)
) />
<!---
Iterate over the array and convert each value to a power
of our character set defined radix.
--->
<cfloop
index="LOCAL.Index"
from="1"
to="#ArrayLen( LOCAL.ValueArray )#"
step="1">
<!---
Convert the current digit and add it to the going sum
of our conversion.
--->
<cfset LOCAL.DecodedValue += (
(ListFind( LOCAL.CharacterList, LOCAL.ValueArray[ LOCAL.Index ] ) - 1) *
(LOCAL.Radix ^ (LOCAL.Index - 1))
) />
</cfloop>
<!--- Return the decoded value. --->
<cfreturn LOCAL.DecodedValue />
</cffunction>
To test this out, I used both a well known character set (Hexadecimal) and an arbitrary character set. For each test, I converted to the new characters-set-based radix and then back to the decimal value.
<!--- Get the HEX character set. --->
<cfset arrCharacters = ListToArray(
"0 1 2 3 4 5 6 7 8 9 A B C D E F",
" "
) />
<!--- Test conversions. --->
255 => #FormatBaseNData( 255, arrCharacters )#<br />
FF => #InputBaseNData( "FF", arrCharacters )#<br />
<br />
<!---
Get our collection of available characters that can be used
when creating character-sensitive tiny URLs.
--->
<cfset arrCharacters = ListToArray(
"a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 0 1 2 3 4 5 6 7 8 9",
" "
) />
<!--- Test conversions. --->
255 => #FormatBaseNData( 255, arrCharacters )#<br />
eh => #InputBaseNData( "eh", arrCharacters )#<br />
When we run this code, we get the following output:
255 => FF
FF => 255255 => eh
eh => 255
As you can see, our control (HEX) and our 62 character set custom radix both encoded and decoded the value properly.
Now that I have these ColdFusion UDFs in place, I think building tiny URL functionality should be much easier.
Want to use code from this post? Check out the license.
Reader Comments
I use a similar function to make unique but verifiable form nonces, to be used in protecting against XSS/XSRF attacks:
<cffunction name="makeNonce" returntype="string">
<cfargument name="inputString" type="string" required="true">
<cfset var md5=hash(arguments.inputString & Request.AppKey & len(arguments.inputString), "MD5")>
<cfset var bigInt=createObject("java","java.math.BigInteger").init(md5,javaCast("int",16)).abs()>
<cfset var chars="bcdfghjklmnpqrstvwxzBCDFGHJKLMNPQRSTVWXZ012345789">
<cfset var charCount=createObject("java","java.math.BigInteger").valueOf(javaCast("long",len(chars)))>
<cfset var remainder="">
<cfset var ret="">
<cfloop condition="bigInt.bitCount() gt 0">
<cfset remainder=bigInt.mod(charCount)>
<cfset bigInt=bigInt.divide(charCount)>
<cfset ret=ret & mid(chars,remainder.intValue()+1,1)>
</cfloop>
<cfreturn ret>
</cffunction>
My character set is a bit reduced, however, as I also use it for nonces for account creation, so I've removed any characters that might be mistaken if they had to be typed out by hand.
@Rick,
I am not sure that I follow what you are doing exactly. However, your use of BigInteger has me curious. One thing that I did get nervous about was the idea of the database going too high to be able to actually handle match on the given value. Of course, I think this would require like over a billion records, but considering the type of service, I am not sure that is beyond reason.
When you create a BigInteger, will ColdFusion be able to handle math with it? Example:
Fix( BigInt / 16 )
... or will it convert it to a standard number when returning the Fix() value?
Unfortunately, CF operators and BigInts don't really get along. The automagic typecasting that CF does tends to do really weird things at unexpected times.
But, just to be obstinate, the BigInt class has methods that are reserved CF words, like .and(), which makes CF choke when you try to chain them such as a.div(b).and(c) . Your example would work, as long as you translated it to: b.divide(sixteen).
Two tricks, though: (1) I said "sixteen" instead of "16" because you can't trust the overloading, and the math functions always expect another BigInt, and (2) since it is an Integer class I don't think it has an equivalent for Fix(). I might be wrong, though.
I used BigInts here because the result of an MD5 hash is a 128-bit number. The InputBaseN function doesn't work for anything larger than 16 bits. So unless I want to do all of the hex conversion and bit math by hand, such as you did, the BigInt is really the only way to go. I'm lazy.
As for overflowing your INT columns in your database, that's definitely a problem if you haven't planned ahead and used 64-bit numbers.
@Rick,
Thanks for the insight. I am not sure I full understand, but I will try to implement the BigInteger stuff to make sure it doesn't barf on large values.
As for the database, my INT column is size 10, which should allow it to be 80 bits, right? Or am I not understanding how the length in a database column works with INT values?
@Rick,
According to the BigInteger, there is a method:
BigInteger.divideAndRemainder() :: result[]
That looks awesome - takes care of both actions at the same time.
@Rick,
I didn't have time to play around with the BigInteger stuff; but, I was able to put the FormatBaseNData() method into effect for creating tiny URLs:
www.bennadel.com/index.cfm?dax=blog:1541.view
I've modified your methods to work with bigInts so I can create shortcodes from really large PKs. Here is the section I changed (with annotated comments where changes were made):