Using IIS URL Rewriting And CGI.PATH_INFO With IIS MOD-Rewrite
Previously, I explored the concept of using URL rewriting with IIS and IIS MOD-Rewrite in order to make ColdFusion's OnMissingTemplate() event handler more effective. This worked fine with some fenagling, but Justice suggested that I take a look at using PATH_INFO. I've only briefly looked into PATH_INFO before, but one thing that I do like about it is that it can be used both with and without URL rewriting. As a quick overview to PATH_INFO, if you have the following URL:
index.cfm/foo/bar/
... the value that comes after the index.cfm (/foo/bar/) is the extra PATH_INFO. Now, in ColdFusion on IIS (I say that since I cannot test it on other systems and I am told that it varies), CGI.SCRIPT_NAME and CGI.PATH_INFO are the same value unless the extra path information is provided. So, for example, if you go to the following URL:
index.cfm
... CGI will report the following values:
script_name: /index.cfm
path_info: /index.cfm
As you can see, when you access a given path, the script and the path are the same. However, if you request the following URL:
index.cfm/foo/bar/
... then CGI will report the following values:
script_name: /index.cfm
path_info: /foo/bar/
As you can see now, the two values are different. While this is a bit odd, but at least it gives us a way to determine when PATH_INFO is being used (when its value is different than the SCRIPT_NAME).
That said, approaching this URL rewriting with PATH_INFO in mind, I created the following IIS MOD-Rewrite configuration file:
# IIS Mod-Rewrite configuration file
# Turn on the rewrite rules for this access file. This will
# will handle all requests based off of this directory.
RewriteEngine On
# If the given file or directory exits, then don't do any
# redirects - simply pass the request on to the file system.
RewriteCond %{REQUEST_FILENAME} -f [OR]
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule .? - [L]
# If the given file does not exist, then rewrite the request
# to use the front controller with the given file set as the
# path info for the script. We cannot make any assumptions
# about the file name as it might be purely a directory path
# without any file extension.
#
# NOTE: Because there is going to be a one-directory difference
# in browser path perception depending on whether or not the
# path info is entered manually, we have to flag that this was
# a rewrite for proper path calculations.
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.+)$ index.cfm/$1?_rewrite [NC,L,QSA]
The first rule just states that if the given file or directory exists, let it pass through. That way, if we get past the first rule, then we know that the given file doesn't exist. At that point, we are going to rewrite the request to the application's front controller (index.cfm) and append the requested script as the PATH_INFO.
When we perform this rewrite, we need to tack on a URL flag indicating that the path info was created during a rewrite process. This is necessary because PATH_INFO can be added manually by the user directly in the URL. Meaning, the following requests can and must result in the same outcome:
/foo/bar/
/index.cfm/foo/bar/
In this case, the first would be handled by the rewrite engine and the second would be manually determined by the user. Now, if you look at the two URLs, you'll notice that the big difference between them is that the first path is two directories deep while the second path is three directories deep. Although the PATH_INFO values are the same, under the hood, we will need to provide two different relative web root values. This is why the rewrite engine needs to append a flag - so that our application framework understands how to create the web root.
With that in mind, let's take a quick look at the Application.cfc file:
Application.cfc
<cfcomponent
output="false"
hint="I define the application settings and event handlers.">
<!--- Define the application. --->
<cfset this.name = hash( getCurrentTemplatePath() ) />
<cfset this.applicationTimeout = createTimeSpan( 0, 0, 5, 0 ) />
<!--- Define page request settings. --->
<cfsetting
requesttimeout="10"
showdebugoutput="false"
/>
<cffunction
name="onApplicationStart"
access="public"
returntype="boolean"
output="false"
hint="I initialize the application.">
<!--- Define the local scope. --->
<cfset var local = {} />
<!---
As part of the application initialization, we
want to figure out some constants surrounding our
application location:
RootDirectory
The root directory of our application.
RootScript
The root script path of our application. This is in
the case where our application lives below the web
root of the server.
RootUrl
The root URL of our application.
--->
<!---
Determining the root path is easy - we always know
that it is this directory (the one containing the
Application.cfc component).
--->
<cfset application.rootDirectory = getDirectoryFromPath(
getCurrentTemplatePath()
) />
<!---
To find the Root Script, we have to do a bit more
calculation; we need to figure out the difference
in the length between the root directory and
requested directory and then subtract that depth
from the requested script.
--->
<!---
Start off with the current script directory as the
root directory.
--->
<cfset application.rootScript = getDirectoryFromPath(
cgi.script_name
) />
<!---
Comparing the expanded root script to the root
directory, we can now figure out how many directories
below the application root we are.
--->
<cfset local.scriptDepth = (
listLen( expandPath( application.rootScript ), "\/" ) -
listLen( application.rootDirectory, "\/" )
) />
<!---
Based on the script depth, we can now move up the path
the corresponding number of steps.
--->
<cfset application.rootScript = reReplace(
application.rootScript,
"([^\\/]+[\\/]){#local.scriptDepth#}$",
"",
"one"
) />
<!---
Now that we have our root script, we can easily find
our root URL. The only special case we need to worry
about is when the root script is "/". In that case,
we are in the root of the web directory and don't need
to append the script.
--->
<cfset application.rootUrl = (
"http://" &
cgi.server_name
) />
<!---
Check to see if we have a script name worth appending
to the URL.
--->
<cfif !reFind( "^[\\/]$", application.rootScript )>
<!--- Append root script to URL. --->
<cfset application.rootUrl &= application.rootScript />
</cfif>
<!--- Return true so the page request can process. --->
<cfreturn true />
</cffunction>
<cffunction
name="onRequestStart"
access="public"
returntype="boolean"
output="false"
hint="I intialize the page request.">
<!--- Define arguments. --->
<cfargument
name="script"
type="string"
required="true"
hint="I am the requested script name."
/>
<!--- Define the local scope. --->
<cfset local = {} />
<!---
Combine the form and url scopes into a common request
attributes collection so that we don't have to know
what scope a variable came from.
--->
<cfset request.attributes = duplicate( url ) />
<cfset structAppend( request.attributes, form ) />
<!---
Param the default action variable - this will be
what the front-controller (and sub-controllers) use
to figure out what scripts to execute.
--->
<cfparam
name="request.attributes.do"
type="string"
default=""
/>
<!---
Split the request variable into an array such that
we can examine the parts of it in front-controller
control flow. We are going to assume that the raw
action variable is a dot-delimmited list of actions.
--->
<cfset request.do = listToArray(
request.attributes.do,
"."
) />
<!---
Now that we have our action variable set up and based
of the query string, let's check to see if the current
page request is actually a URL Rewriting as determined
by a PATH_INFO that is diffrent from the SCRIPT_NAME
available in the CGI object (normally, these two are
the same value unless path info is used explicitly).
If so, we might have to translate the PATH_INFO into
a new action AND a set of query string parameters.
--->
<cfif (cgi.script_name neq cgi.path_info)>
<!---
Create a normalized version of the script as taken
from the path_info variable. Essentially, we are
removing any leading or trailing slashes.
--->
<cfset local.script = reReplace(
cgi.path_info,
"^[\\/]+|[\\/]+$",
"",
"all"
) />
<!---
Now that we have our script name normalized, let's
use some regular expression pattern matching to
see if we need to update our action varaible or
any other URL variables.
--->
<cfif reFind( "^contact\b", local.script )>
<!--- Routing to contact section. --->
<cfset request.do = [ "contact" ] />
<cfelseif reFind( "^about\b", local.script )>
<!--- Routing to about section. --->
<cfset request.do = [ "about" ] />
<cfelseif reFind( "^blog/[\d+]", local.script )>
<!--- Routing to blog section. --->
<cfset request.do = [ "blog" ] />
<!--- Get ID of blog post. --->
<cfset request.attributes.id = listGetAt(
local.script,
2,
"/"
) />
<cfelseif reFind( "^blog\b", local.script )>
<!--- Routing to blog section. --->
<cfset request.do = [ "blog" ] />
<cfelse>
<!---
We could not match the requested URL against
any of our SES patterns. As such, this is
truly an invalid file request. As such, let's
return a true 404 error.
--->
<cfheader
statuscode="404"
statustext="Page Not Found"
/>
<!---
Return out with false so the request of the
page will not get processed.
--->
<!--- <cfreturn false /> --->
</cfif>
</cfif>
<!---
Get the relative web root path from our current page
(this will allow our traversal path to always be
relative, rather than a hard-coded root path, which
is the ultra lame... like really really lame).
When calculating this path, we need to take into
account BOTH the current file as well as any PATH_INFO
value since the browsers views them both as adding to
the depth of the page request.
Get the initial web root based only on the requested
page template.
--->
<cfset request.webRoot = repeatString(
"../",
(
listLen( getDirectoryFromPath( expandPath( arguments.script ) ), "\\/" ) -
listLen( application.rootDirectory, "\\/" )
)) />
<!---
Now that we have the base web root from the requested
template, we have to see if there is any extra pathing
being used. As before, we will determine this to be
true if the script name and the path info are different
values.
--->
<cfif (cgi.script_name neq cgi.path_info)>
<!---
Because there will be a one directory difference
in browser perception if the PATH_INFO was entered
manually, versus if this was a rewrite, we have to
check for the rewrite flag.
--->
<cfif structKeyExists( url, "_rewrite" )>
<!---
There will be an offset required when using
the PATH_INFO for calculation.
--->
<cfset local.webRootOffset = 1 />
<!---
Delete the rewrite flag as it will not be
needed for anything else in this request.
--->
<cfset structDelete( url, "_rewrite" ) />
<cfset structDelete( request.attributes, "_rewrite" ) />
<cfelse>
<!---
The PATH_INFO was entered manually. As such,
there will be no offset needed for the web
root.
--->
<cfset local.webRootOffset = 0 />
</cfif>
<!---
We are using extra PATH_INFO. The good news here
is that the browser acts the SAME whether or not
the pathing is done via the URL or via the rewrite
since they both add the same depth to the request.
--->
<cfset request.webRoot &= repeatString(
"../",
(
listLen( (cgi.path_info & "-" ), "\/" ) -
local.webRootOffset
)) />
</cfif>
<!--- Return true so that the page can be processed. --->
<cfreturn true />
</cffunction>
<cffunction
name="onRequest"
access="public"
returntype="void"
output="true"
hint="I execute the page request.">
<!--- Define arguments. --->
<cfargument
name="script"
type="string"
required="true"
hint="I am the requested script name."
/>
<!--- Include the requested page. --->
<cfinclude template="#arguments.script#" />
<!--- Return out. --->
<cfreturn />
</cffunction>
</cfcomponent>
I won't go into too much detail since I've talked about this before AND I need to start working. The pattern matching works just as it did in my previous posts - the difference being that I'm pulling my information out of the CGI.PATH_INFO value. The real gotcha in this approach is that the relative web root becomes a bit more complicated. Not only can pathing be done using URL rewriting as well as manually entered by the user but, the relative web root needs to take into account not only the requested script but also the path_info as well. Hopefully, in the code comments, it is clear what I am doing.
In the end, I have to say that I rather like that PATH_INFO approach if for no other reason, that it can be done without any URL rewriting at all. Thanks Justice for the suggestion.
Want to use code from this post? Check out the license.
Reader Comments
Excellent post, Ben.
Very interesting to see the built-in handler methods used to help control the rewrite rules.
Time to have a play with this myself :)
@Matt,
Yeah, I think I might want to convert my blog over to using this approach. I use a similar technique, but all 404 powered (as thrown by IIS). This seems much cleaner - and that I can simulate it *without* any rewriting as well (just straight up path_info) is rather awesome.
@Ben,
Thanks for the shoutout! And good job getting this kind of scenario up and running, and then writing a good post about it.
I see you decided to try to tackle a thorny problem that I was thinking about too for a while. In the end, I decided just to sidestep the whole problem. This is the approach I took:
Any path that matches with '/media/.*' gets passed through unaltered. Obviously, that means you put all CSS/JavaScript/images somewhere in a folder 'media' which is itself directly in the webroot.
Next, *rewrite every single path* from '/foo/bar' to '/index.cfm/foo/bar'. No questions asked. Rewrite will always be performed. No more wondering whether this is *really* a rewrite. (Recall that all of the static file paths already got matched by the previous rule and don't get rewritten by this rule.)
To create a more portable app, you would actually want to support the following two scenarios: *no* paths are rewritten throughout the whole application, and *all* paths are rewritten throughout the whole application (except for paths in the '/media' folder). But this setting can be set statically somewhere, depending on how the app is deployed, and it is a global setting, rather than a per-page-request setting.
I wouldn't mix-and-match rewriting and non-rewriting within the same app and within the same deployment of that app, because that tends to make things ... complicated. My approach certainly seems more restrictive at the outset, but I think that it trades away the ability to do something you don't really really need to do anyway for a little bit of extra simplicity in your life. Supporting both '/foo/bar' and '/index.cfm/foo/bar' at the same time is tricky for the reasons you mentioned (what if it's just '/index.cfm', what about relative paths, etc.), but in the end I don't think you really need to go through all that complexity because I think that should be a global, app-wide setting, not a per-page-request setting.
Happy coding!
Cheers,
Justice
@Justice,
The one thing, though, that I can't quite figure out is how to reconcile what the web browser sees with what the app engine sees. After all, the rewriting happens at the server level, not the browser level (unless you do a hard redirect). As such, even I were to rewrite *all* the URLs, the browser would still be fooled by its own path_info.
For example, let's say someone does type in:
index.cfm/foo/bar/
... even if I rewrite on the server, the browser still sees the user as 3 levels deep, which will necessitate a web root of "../../../".
Even with a rewrite-everything rule, how do you deal with that?
@Ben,
The web browser should always see '/foo/bar'. All of your URLs in the HTML should say '/foo/bar'. If you have a link in your HTML to, say '../diz', then that would be resolved to '/foo/bar/../diz' = '/foo/diz'. Behind the scenes, there would be an '/index.cfm' involved in processing everything, but that is invisible to the browser.
I typically like to use absolute paths in my links. So in the HTML for '/foo/bar' I might have a link to '/foo/diz' rather than a link to '../diz'.
It should be impossible for the user to access '/index.cfm' directly. In other words, if the user typed in '/index.cfm', then the rewrite engine would kick in and then ColdFusion would actually see '/index.cfm/index.cfm'. Likewise, if the user typed in '/default.php', then IIS would do an internal redirect and would see '/index.cfm/default.php' and would send the request to CF instead of PHP. Likewise, if the user types in '/index.cfm/foo/bar' then IIS does an internal redirect and calls the '/index.cfm' template with a PATH_INFO of '/index.cfm/foo/bar'.
Ultimately, *no* templates are directly accessible from the browser. Of course if someone tries to access '/foo/bar/baz.cfm' from the browser, then IIS does an internal rewrite to '/index.cfm/foo/bar/baz.cfm' and then your '/index.cfm' file is allowed to check if the PATH_INFO value of '/foo/bar/baz.cfm' exists as a real path on the server and then cfinclude that template. I actually consider this to be a benefit.
Cheers
Justice
@Justice,
OK, I see what you're saying. So you rewrite every file, whether it exists or not. I hadn't thought of that. I think I was in the world of physical files for so long, that it never occurred to me to write for files that existed (only for ones that don't).
I think there is a certain amount of sense to what you are doing - everything goes through the front controller, whether you like it or not (outside your Media folder of course).
My only concern with this approach would be that you'd have to really work that angle from the get-go, otherwise, you could quite easily cripple a site. My blog, for example, has all kinds of random things on it (presentations, demos, sample apps) that exist outside the main framework. As such, there is no logic in my front controller that knows how to handle that.
Of course, I could always just add additional exceptions for that (such as nothing in the "resources" folder gets re-written).
@Ben,
Exactly. In my head, rewriting is the rule, and the /media/ and /resources/ dirs are the exception.
One thing that you can easily do within a pre-existing app is to have a special folder like /go/ where only stuff like '/go/foo/bar' gets rewritten to '/go/index.cfm/foo/bar'. So that makes the rewriting the exception and leave-it-alone the rule.
So, purely as an example for your blog, instead of browsing to '/index.cfm?dax=blog:1744.view', we would browse to '/blog/posts/1744'. This would get rewritten internally to '/blog/index.cfm/posts/1744' (and that path would be inaccessible from the browser). But only paths that already start with '/blog/' would get rewritten in this way - all other paths would be left alone, leaving the way most of your site works intact.
Cheers,
Justice
@Justice,
Thanks for the clarification. I suppose that makes sense; I always default to creating some sort of flexibility into my code. But, I think that desire is completely arbitrary - not dictated by an actual need.
Any helped would be awesome... So I'm using mod-rewrite and need to use link rel="canonical"
the problem is that coldfusion doesn't read the re-write and I cant figure out a way to place the new rewrite into the link rel.
<link rel="canonical" href="http://www.example.com/exampledir/example.html">