Converting A Byte Array Into A Bit Stream Using ColdFusion And BigInteger
The other day, I was faced with a situation where I had to split a set of bytes up into smaller sets of bits (smaller than the 8 bits that go into a byte). I was definitely stumped. Not only have I never done this before but, my understanding of bits (especially signed-bits) is rather weak. After some trial an error, however, I was able to leverage Java's BigInteger class to help me turn a byte array into a bit stream that can read an arbitrary number of bits at a time.
So, again, before I show the code, let me once again caveat that my understanding of the low-level binary representation of numbers is not super strong. I can get it to work - but that doesn't mean that it will handle outlier cases in an appropriate manner.
That said, I created a ColdFusion component - BitStream.cfc - that can take a string or a byte array and present two methods:
- read( size )
- readBigIngeter( size )
Both of them do the same thing - read the given number of bits into a numeric result. But, the first one returns a primitive integer whereas the second one returns a Java BigInteger. The reason for the BigInteger version is that the number of bits being read may not fit into a normal 32-bit integer (or is that 31 bits, less the signed bit - that's where I really start to get confused?).
Internally, you will see that they both use readBitInteger(); but, the read() version will simply return the .intValue() of the BigInteger. In either case, the stream will return -1 (or the BigInteger representation of -1) when there are no more bits to read.
Here is the BitStream.cfc ColdFusion component:
<cfscript>
// NOTE: CFScript tags used for Gist color-coding. They can be removed.
component
output = false
hint = "I provide a bit-based stream wrapper to an underlying byte array."
{
/**
* I take either a binary value (byte array) or a string and provide read-access to
* the underlying bits. If a string is passed-in, it is assumed to be UTF-8 encoded.
*
* @output false
*/
public any function init( required any input ) {
// If the input is a byte array, we can use it as-is.
if ( isBinary( input ) ) {
var byteArray = input;
// If the input is a string, we can decode the bytes.
} else if ( isSimpleValue( input ) ) {
var byteArray = charsetDecode( input, "utf-8" );
// If input is not supported for conversion.
} else {
throw(
type = "InvalidInput",
message = "Input must be one of the following: ByteArray, String, Integer"
);
}
// We're going to use the BigInteger to access the individual bits.
source = createObject( "java", "java.math.BigInteger" ).init( byteArray );
// If the leading bit(s) are zeros, they are not included in the bit length of
// the BitInteger, which will mess up our read. But, since this is all coming
// from a byte array, we can determine the bit length using the number of bytes.
bitLength = ( arrayLen( byteArray ) * 8 );
// Since the bits in the BigInteger are zero-based, we need to start at n-1.
bitIndex = ( bitLength - 1 );
// Each read action will start with this result and then start shifting and
// settings bits (which will return a new BigInteger instance).
resultTemplate = createObject( "java", "java.math.BigInteger" )
.init( javaCast( "string", "0" ) )
;
// When there are no more bits to read in the underlying source, we are going
// to return (-1) to indicate the end of the stream.
nonResultTemplate = createObject( "java", "java.math.BigInteger" )
.init( javaCast( "string", "-1" ) )
;
}
/**
* I read the given number of bits from the stream and return them as the right-most
* bits of an integer. If there are not enough bits available for a full-read, the
* result will be right-padded with zeros. When there are no more bits to read, a -1
* is returned.
*
* @output false
*/
public numeric function read( required numeric size ) {
return( readBigInteger( size ).intValue() );
}
/**
* I read the given number of bits from the stream and return them as the right-most
* bits of a BigInteger. If there are not enough bits available for a full-read, the
* result will be right-padded with zeros. When there are no more bits to read, a -1
* is returned (in the form of a BigInteger).
*
* @output false
*/
public any function readBigInteger( required numeric size ) {
// If we've gone past the bounds of BigInteger, return -1. There are no more
// bits to read from the underlying source.
if ( bitIndex < 0 ) {
return( nonResultTemplate );
}
// Start out result as zero.
var result = resultTemplate;
// When read read from the stream, we'll have to see if we miss any bits, which
// will have to be right-padded with zeros.
var missingBits = size;
// Keep reading a bit at a time until we fulfill the size requirement or, until
// we run out of bits to read.
while ( missingBits && ( bitIndex >= 0 ) ) {
// Will add a zero as the right-most bit.
result = result.shiftLeft( javaCast( "int", 1 ) );
// If the source bit is set (ie, non-zero), then toggle the right-most result
// bit (the one we just left-shifted into the result).
if ( source.testBit( javaCast( "int", bitIndex ) ) ) {
result = result.setBit( javaCast( "int", 0 ) );
}
// Update our while-conditions.
missingBits--;
bitIndex--;
}
// If we ran out of bits, mid-read, then we have to right-pad with zeros.
while ( missingBits-- ) {
result = result.shiftLeft( javaCast( "int", 1 ) );
}
return( result );
}
}
</cfscript>
To test this, I will convert a simple string - "abc" - into a set of bits and then compare that set to the bits that come out of the stream:
<cfscript>
// Each of the characters in this string is represented by a byte, which is composed
// of 8 bits. We're going to wrap this in a BitStream and then read chunks of the
// bits out, 5 at a time.
input = "abc";
// Create the bit stream.
stream = new BitStream( input );
// To demonstrate that the bit stream works, let's output the bits of the input
// string. This way, we can see how our 5-bit reads line up with the bits of the
// entire stream.
for ( character in listToArray( input, "" ) ) {
bitValue = formatBaseN( asc( character ), 2 );
// NOTE: We are using numberFormat so we don't lose leading zeros.
writeOutput( numberFormat( bitValue, "00000000" ) );
}
writeOutput( "<br />" );
results = [];
// Read first 5 bits.
result = stream.read( 5 );
// When we run out of bits, the stream will return -1.
while ( result != -1 ) {
// Get the underlying bit-string in our read chunk.
bitValue = formatBaseN( result, 2 );
// NOTE: We are using numberFormat so we don't lose leading zeros.
arrayAppend( results, numberFormat( bitValue, "00000" ) );
// Read the next chunk of bits.
result = stream.read( 5 );
}
// Now that we've read all the chunks of 5 bits, output with and without a space-
// delimiter. This will help us see if the resultant bits line up with the input
// stream AND also help us see where the chunks were bounded.
writeOutput( arrayToList( results, "" ) );
writeOutput( "<br />" );
writeOutput( arrayToList( results, " " ) );
</cfscript>
In this code, I'm using numberFormat() to make sure that I don't accidentally lose leading zeros in the output. This way, it will be easy to see where things line up. And, when we run the above code, we get the following page output:
011000010110001001100011
0110000101100010011000110
01100 00101 10001 00110 00110
As you can see, the first line (input bits) and the second line (stream bits) match up, except for the extra zero on the stream. The extra zero is a right-padded read-result. In order to make sure that N bits can always be read (assuming some bits are available), we need to right-pad with zeros. And since 5 (our read size) doesn't evenly divide into 24 (our total number of input bits), we need to make up the difference with zeros.
The third line contains the same data as the second line, but with a space between each read result. This way, we can easily see how the input bits map onto the stream bits.
Right now, this was just a stepping stone to a bigger concept that deals with base-encoding. But, more on that to come.
Want to use code from this post? Check out the license.
Reader Comments
When you say "signed-bit" that's confusing me - generally I know the bit that signs a byte as a "sign bit" or "signing bit", and then the byte itself can be a "signed byte"
I'm not trying to be pedantic - but can you actually "sign" a bit - isn't it just 0 or 1, as opposed to -1 or 1?
@Pete,
You are correct - I was not speaking accurately. You cannot sign an individual bit - they are always 1 or 0. I was referring to the bit that indicates the sign of a set of bytes (ie, the left-most bit).
Here's a bit about signed integers (pun intended).
An N-bit unsigned integer can only store positive values from zero to 2^N-1. So an 8-bit unsigned integer can store values from 0 to 255 decimal. In most cases the right-most bit is the least significant and the left-most is most significant, just like decimal numbers, except that in binary we're dealing with powers of 2 instead of 10. In this case, the right-most bit represents 2^0, followed by 2^1, 2^2, etc. up to 2^7 for the left-most bit.
0000 0010 = 2
0001 0000 = 16
1000 0000 = 128
1010 1010 = 128 + 32 + 8 + 2 = 170
1111 1111 = 128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = 255
For signed integers, the left-most bit (the most significant bit) is flipped to represent the negative of the amount it would represent in an unsigned integer. This "shifts" the range available to have zero in the middle of the range (well, almost). For an 8-bit signed integer, the range is -128 to 127. You calculate the represented value in the same way, but remember the MSB is negative.
0000 0010 = 2
0000 1111 = 15
0111 1111 = 64 + 32 + 16 + 8 + 4 + 2 + 1 = 127
1000 0000 = -128
1010 1010 = -128 + 32 + 8 + 2 = -86
1111 1111 = -128 + 64 + 32 + 16 + 8 + 4 + 2 + 1 = -1
So for positive integers up to 127, the bits are the same as those used for unsigned integers. But, when you flip the most significant bit to 1, the result goes all the way to the negative extreme of the range and counts back up. With signed integers, the value -1 is always represented by all 1's.
Any number with a 1 for the most significant bit will be negative. That's why it's sometimes referred to as the "sign bit."
@Darren,
Very interesting. I've never really looked at the sign stuff, but that was definitely different than the mental model that I had made up. I had sort of just assumed that the MSB was the negative sign; BUT, that nothing else changed. I pictured it like base10, where the only difference between 456 and -456 is the presence of the negative sign. I didn't realize that the logic of the set bits also changes.
That's super interesting. So, the bits always "add". It just depends on whether they adding to "0" or to "-128".
Am I correct in assuming that there is no way to look at a byte and tell whether or not it is signed? Right? It seems that can only be known from the context, not from the data. So, for example, if a Database has a schema that defines an unsigned INT, the only thing that changes is the schema definition and not the actual data.... if that question makes sense.
@All,
Also, as a follow-up, I was playing with this in an attempt to figure out how to implement Base32-encoding:
www.bennadel.com/blog/2687-using-base32-encoding-and-decoding-in-coldfusion.htm
In Base32, you have to split up the byte "stream" into chunks of 5-bit bytes which get encoded using a Base32 alphabet.
@Ben,
Yep, if you just have a set of bits in front of you and nothing else, there's no way to know what they represent without applying some sort of data type. An arbitrary set of 32 bits could be a 32-bit signed or unsigned integer, could be a floating point number, could be two 16-bit or four 8-bit integers, could be anywhere from 1-4 UTF-8 characters. It's all in how you interpret them. :)
The technical term for representing integers this way is "two's complement," and the main reason it's used is for consistency in arithmetic operations. With two's complement, the low-level addition, subtraction and multiplication operations remain the same whether the value is signed or unsigned, which does not hold true for the sign-magnitude system (using the MSB as a negate "flag"). The reasons behind that get a little complicated, but if you're interested in a math adventure, just Google it!
Also, fascinating stuff with your Base32 component. Thanks for sharing that. Coldfusion does not make bit splitting simple, but it's nice to know you can get it done!
@Darren,
It is nice that the math stays the same (less the sign bit). It's interesting to think about the implementation at this level - much different than I'm used to thinking. For the most part, I'm just happy when I can get UTF-8 characters to show up properly on a website :D
As far as bit-wise stuff in ColdFusion, I'm still learning. I've the bitAnd/bitOr functions before. But, there are even a number of bit functions that I don't understand (the ColdFusion documentation is sometimes very lacking). But, I might go down that rabbit hole for a bit.
That's the crazy thing about programming -- every hurdle is actually a rabbit hole :|
@Darren,
I have to say, your explanation really saved my bacon!
www.bennadel.com/blog/2689-creating-signed-java-byte-values-using-coldfusion-numbers.htm
I was totally stumped on how to create a Java byte, which is a signed value, using a ColdFusion number, which is 32-bits. If you hadn't told me how twos-compliment works, I would have been forever lost! Much thanks!
@Ben,
I'm so glad I could help! I rejoice any time bacon is saved. Mmmm...
--Darren
@Darren,
Ha ha, mmmm, bacon :)