Detecting File Type Using Magic Numbers In ColdFusion
At InVision, we get a lot of users that upload files with an incorrect file extension. Typically, when this happens, a user has uploaded a ".jpg" file that is actually a Photoshop (PSD) file under the hood. To work with this problem, I want to try and build in some more intelligent error reporting; so, I started looking into detecting file type using a file's "magic number."
According to Wikipedia, a "magic number" is a sequence of bytes that is embedded within a file in an attempt to uniquely identify the file type:
One way to incorporate file type metadata, often associated with Unix and its derivatives, is just to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method.
I don't know much more about it than this, so I won't offer any additional explanation. All I present below is my attempt to explore file-type-identification by reading in the leading bytes of a given file and comparing it to a set of known hexadecimal signatures.
The signatures I use in my example were collected from a few different sources:
- List of file signatures
- Magic Numbers in Files
- File Signatures Table
- File Magic Number
- File Signatures Database
As you can see in the following code, a given file type may have several "known" magic number signatures. Furthermore, a few different file types actually have the same magic number signature:
<cfscript>
// I determine if the given file starts with the given hex
// signature. We can use this to losely determine the file type
// of the given file.
boolean function startsWithHexSignature(
required string filepath,
required string hexSignature,
numeric signatureOffset = 0
) {
// The byte signature is half the length of the hex signature
// since a single byte is represented by two hex characters.
var signatureLength = ( len( hexSignature ) / 2 );
// I am the byte buffer into which bytes will be streamed.
var byteBuffer = [];
arraySet( byteBuffer, 1, signatureLength, 0 );
// Convert the ColdFusion array to a Java byte array so that
// the input stream will work with it.
byteBuffer = javaCast( "byte[]", byteBuffer );
// Open an input stream to the file.
var inputStream = createObject( "java", "java.io.FileInputStream" ).init(
javaCast( "string", filepath )
);
try {
// If the signature if offset, read off the bytes that
// we want to ignore.
if ( signatureOffset ) {
for ( var i = 0 ; i < signatureOffset ; i++ ) {
inputStream.read();
}
}
// Read off the signature bytes.
inputStream.read( byteBuffer );
return( binaryEncode( byteBuffer, "hex" ) == hexSignature );
} catch ( any ioError ) {
// If anything went wrong with the input stream read,
// then we'll assume the signatures don't match.
return( false );
} finally {
// No matter what happens, clean up the file.
inputStream.close();
}
}
// ------------------------------------------------------ //
// ------------------------------------------------------ //
// ------------------------------------------------------ //
// ------------------------------------------------------ //
// Set up some test files and known hex signatures for the given
// files. NOTE: I gathered these signatures across different
// sites which is why there may be multiple tests for a given
// file type.
tests = [
{
filename = "file.ai",
hexSignature = "25504446",
offset = 0
},
{
filename = "file.avi",
hexSignature = "52494646",
offset = 0
},
{
filename = "file.avi",
hexSignature = "415649204C495354",
offset = 0
},
{
filename = "file.gif",
hexSignature = "474946383761",
offset = 0
},
{
filename = "file.gif",
hexSignature = "474946383961",
offset = 0
},
{
filename = "file.jpg",
hexSignature = "FFD8FFE0",
offset = 0
},
{
filename = "file.jpg",
hexSignature = "494600",
offset = 0
},
{
filename = "file.jpg",
hexSignature = "FFD8FFE1",
offset = 0
},
{
filename = "file.mov",
hexSignature = "0000001466747970",
offset = 0
},
{
filename = "file.mov",
hexSignature = "6D6F6F76",
offset = 0
},
{
filename = "file.mov",
hexSignature = "71742020",
offset = 4
},
{
filename = "file.mov",
hexSignature = "66726565",
offset = 4
},
{
filename = "file.mov",
hexSignature = "6D646174",
offset = 4
},
{
filename = "file.mov",
hexSignature = "77696465",
offset = 4
},
{
filename = "file.mov",
hexSignature = "706E6F74",
offset = 4
},
{
filename = "file.mov",
hexSignature = "736B6970",
offset = 4
},
{
filename = "file.mp3",
hexSignature = "FFFB",
offset = 0
},
{
filename = "file.mp3",
hexSignature = "494433",
offset = 0
},
{
filename = "file.mp4",
hexSignature = "0000001866747970",
offset = 0
},
{
filename = "file.mp4",
hexSignature = "33677035",
offset = 0
},
{
filename = "file.pdf",
hexSignature = "25504446", // <-- NOTE: Same as AI.
offset = 0
},
{
filename = "file.png",
hexSignature = "89504E470D0A1A0A",
offset = 0
},
{
filename = "file.psd",
hexSignature = "4F676753",
offset = 0
},
{
filename = "file.psd",
hexSignature = "38425053",
offset = 0
},
{
filename = "file.xlsx",
hexSignature = "504B0304", // <-- NOTE: Same as ZIP.
offset = 0
},
{
filename = "file.zip",
hexSignature = "504B0304",
offset = 0
},
{
filename = "file.zip",
hexSignature = "504B0506",
offset = 0
},
{
filename = "file.zip",
hexSignature = "504B0708",
offset = 0
},
{
filename = "file.zip",
hexSignature = "504B4C495445",
offset = 30
},
{
filename = "file.zip",
hexSignature = "504B537058",
offset = 526
},
{
filename = "file.zip",
hexSignature = "57696E5A6970",
offset = 29
},
{
filename = "file.zip",
hexSignature = "57696E5A6970",
offset = 152
},
{
filename = "file.zip",
hexSignature = "1F8B08",
offset = 0
}
];
// Loop over each test and test the hex signature.
for ( test in tests ) {
writeOutput( "#test.filename# [ #test.hexSignature# ] : " );
writeOutput(
startsWithHexSignature(
expandPath( "./tests/" & test.filename ),
test.hexSignature ,
test.offset
)
);
writeOutput( "<br />" );
}
</cfscript>
When I run the code, I get the following page output:
file.ai [ 25504446 ] : YES
file.avi [ 52494646 ] : YES
file.avi [ 415649204C495354 ] : NO
file.gif [ 474946383761 ] : NO
file.gif [ 474946383961 ] : YES
file.jpg [ FFD8FFE0 ] : NO
file.jpg [ 494600 ] : NO
file.jpg [ FFD8FFE1 ] : YES
file.mov [ 0000001466747970 ] : YES
file.mov [ 6D6F6F76 ] : NO
file.mov [ 71742020 ] : NO
file.mov [ 66726565 ] : NO
file.mov [ 6D646174 ] : NO
file.mov [ 77696465 ] : NO
file.mov [ 706E6F74 ] : NO
file.mov [ 736B6970 ] : NO
file.mp3 [ FFFB ] : NO
file.mp3 [ 494433 ] : YES
file.mp4 [ 0000001866747970 ] : YES
file.mp4 [ 33677035 ] : NO
file.pdf [ 25504446 ] : YES
file.png [ 89504E470D0A1A0A ] : YES
file.psd [ 4F676753 ] : NO
file.psd [ 38425053 ] : YES
file.xlsx [ 504B0304 ] : YES
file.zip [ 504B0304 ] : YES
file.zip [ 504B0506 ] : NO
file.zip [ 504B0708 ] : NO
file.zip [ 504B4C495445 ] : NO
file.zip [ 504B537058 ] : NO
file.zip [ 57696E5A6970 ] : NO
file.zip [ 57696E5A6970 ] : NO
file.zip [ 1F8B08 ] : NO
I was able to find a matching signature for each of the given file types above; but for file types that had multiple signatures, clearly, only one of them will match. I probably wouldn't use this kind of approach to make critical decisions; but, I'll definitely look at using this approach as part of intelligent error-reporting so that I can gain more insight into why certain files can't be processed.
Want to use code from this post? Check out the license.
Reader Comments
Thanks for the nice summary.
You might also want to take a closer look at TrID (or the stand-alone DLL behind it), which does various checks to return the file types with the highest probability:
http://mark0.net/soft-trid-e.html
@Michael,
Thanks for the link - looks like a cool utility. I like that it gives you percentage confidence that a given file is of a given type. On the flip side, it's saddens me that files can be relatively ambiguous. I'll definitely look into it more.
There will always be a certain level of multiple candidates in certain situations (e.g., MP3s containing ID3 data of different versions or old files from back when DivX and Xvid split into two separate codec projects and the differences were still minor).
@Michael,
Word. That's why I'll probably use this detection in response to an event, rather than as means to define which event takes place.
You may already be aware of this, but I wanted to mention it just in case:
In Coldfusion 10, if you specify an
attribute on a
tag and use a list of MIME types instead of file extensions, the tag will actually look at the "first few bytes" of uploaded files to determine the MIME type and use that to determine whether to allow the upload. I assume it uses some implementation of magic numbers internally.
For more info, see the Coldfusion Reference page for cffile (scroll down a bit to the "Modifications to the attribute accept in ColdFusion 10" section):
http://help.adobe.com/en_US/ColdFusion/10.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7fa1.html
The
help page has not been updated with the new behavior... that's where I would have put it!
BTW: Ben you actually provided an even better example yourself.
Files produced by modern versions of MS Office and OpenOffice will also match the magic number for ZIP files because they're just that, renamed ZIP archives containing a ton of XML files and related junk.
That's why file type detection tools need to dig deeper.
Always enjoy reading your "exploratory" posts :)
Was wondering if you could do any of this validation client-side (with FileReader) to make this determination. "Hey you're trying to upload a PSD, but it looks like a JPG"
@Ben,
Would you please explain the Darren's point of view?
@Muhammad,
I was just highlighting a new feature in Coldfusion 10 that (at least to me) was not promoted very well when the new version was released and is barely mentioned in the documentation, but in my opinion is a significant improvement. I think in this post, Ben is just exploring the "code behind" magic number file type detection, which I thought was pretty fascinating!
When you process a form upload with <cffile action="upload"> it takes an optional "accept" attribute. Before CF10 you could only provide a comma-separated list of file extensions, and Coldfusion just checked for an extension match; it did not look into the contents of the file. In CF10, if you use full MIME types for the accept attribute, CF will actually read the "first few bytes" of each uploaded file and determine whether it matches an allowed MIME type.
So for example, if you have <cffile action="upload" accept="application/xml">, Coldfusion will actually look inside uploaded files to verify that they are XML files. If a user uploads a file that does not begin with '<?xml' (as all XML files must) Coldfusion will reject the upload regardless of the file extension.
A full list of MIME types Coldfusion can validate is not in the documentation and I wasn't able to find one. I know from experience that common image MIME types (image/gif, image/jpeg, image/png) and common document types (application/pdf, application/msword, application/vnd.openxmlformats-officedocument.wordprocessingml.document) work quite reliably; Coldfusion detects these MIME types correctly regardless of the file extension. Might be interesting to explore more of which MIME types work and which don't, and how good the detection is...
This blog post by Sagar Ganatra, former Adobe engineer, has some additional information about the 'accept' and 'strict' attributes: http://www.sagarganatra.com/2012/03/coldfusion-10-cffile-restricting-file.html
@Darren,
Very cool - I wasn't aware of the new functionality. Unfortunately, I'm currently on CF9.0.1 in production :( Also, one nice thing about going into the binary manually is that we can mess with the .TMP file that ColdFusion creates and do some detection there before actually calling the CFFile[upload].
@Michael,
Oh dang, I didn't realize that was what was going on. I just figured they used the same byte signature - not that they were actually the same type of file under the hood. You make a good point - true file detection (like the DLL you mentioned) really does need to dig deeper into the data that is available.
@Gareth,
Good question about client-side validation. Can't say that I've ever tried it. Currently, my client-side upload is all handled in a somewhat "blackbox" manner using the Plupload uploader library:
www.bennadel.com/blog/2417-Using-Plupload-For-Drag-Drop-File-Uploads-In-ColdFusion.htm
They have some filtering (but I think only based on file extensions). I honestly know nothing about reading actual bytes / binary data on the client.
Didn't know it worked that way! I got some weird Java class def errors today that also mentioned magic numbers, so that makes some more sense now.
I just wanted to mention also that you could use Files.probeContentType: http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html#probeContentType(java.nio.file.Path)
You need Java 1.7 BTW, it's been out for quite a while but on my machine I still had to manually upgrade. I'm curious about the relative usage of 1.7 compared to 1.6. Wouldn't be surprised if the majority is still at 1.6.
@Jeroen,
Very cool - I had no idea that was available. Honestly, I'm not even sure which version of Java I have running on my work machine. I think it might be 1.6 - not sure I ever upgraded. Although, now that I upgraded to Mountain Lion, I may have had to install 1.7. I'll have to check it out later - thanks for the tip!
@Gareth - I'm actually starting work on a no-dependency MIT-licensed tool that aims to efficiently and accurately identify files client-side. You can watch progress at https://github.com/rnicholus/determinater.
@Gareth - Hmm, looks like the period at the end of my sentence was included in the generated anchor link's href. Here it is again: https://github.com/rnicholus/determinater .
@Ray,
Sounds like a cool project. I know almost nothing about reading files on the client-side of things. Sounds really cool!