ColdFusion Session Management Revisited... User vs. Spider III
I have blogged numerous times about session management in ColdFusion. More specifically, about Michael Dinowitz's suggestion of turning off session management of web spiders so as to cut down on the amount of client variables that we end of storing. Web spiders do not accept cookies and therefore (on my site) they create a new ColdFusion Session for each page they hit on the site. Spidering a few hundred pages in a matter of minutes can cause variable creation to sky rocket.
Previously, I was testing for the type of user base on the CGI.http_user_agent. If it was a suspect type of user agent, I turned off session management. The code looked something like this:
// Define the application. To stop unnecessary memory usage, we are going to give
// web crawler no session management. This way, they don't have to worry about cookie
// acceptance and object persistence (except for APPLICATION scope). Here, we are
// using short-circuit evaluation on the IF statement with the most popular search
// engines at the top of the list. This will help us minimize the amount of time that
// it takes to evaluate the list.
// Create a lowercase version of the user agent so we can run without NoCase checks.
strTempUserAgent = LCase( CGI.http_user_agent );
// Check user agent.
if (
(NOT Len(strTempUserAgent)) OR
// We are gonna try to optimize even a little bit more. A good number of the spider
// names end in "bot". If we check for names that have BOT ending on a word boundary,
// we can eliminate several of the other spider checks. The bot\b search here takes
// care of the spiders that are now commented out below. As you can see, it takes
// the place of 18 different spider Find()'s.
REFind( "bot\b", strTempUserAgent ) OR
// This will try to get any RSS feed readers.
REFind( "\brss", strTempUserAgent ) OR
Find( "slurp", strTempUserAgent ) OR
Find( "mediapartners-google", strTempUserAgent ) OR
Find( "zyborg", strTempUserAgent ) OR
Find( "emonitor", strTempUserAgent ) OR
Find( "jeeves", strTempUserAgent ) OR
Find( "sbider", strTempUserAgent ) OR
Find( "findlinks", strTempUserAgent ) OR
Find( "yahooseeker", strTempUserAgent ) OR
Find( "mmcrawler", strTempUserAgent ) OR
Find( "jbrowser", strTempUserAgent ) OR
Find( "java", strTempUserAgent ) OR
Find( "pmafind", strTempUserAgent ) OR
Find( "blogbeat", strTempUserAgent ) OR
Find( "converacrawler", strTempUserAgent ) OR
Find( "ocelli", strTempUserAgent ) OR
Find( "labhoo", strTempUserAgent ) OR
Find( "validator", strTempUserAgent ) OR
Find( "sproose", strTempUserAgent ) OR
Find( "ia_archiver", strTempUserAgent ) OR
Find( "larbin", strTempUserAgent ) OR
Find( "psycheclone", strTempUserAgent ) OR
Find( "arachmo", strTempUserAgent )
// I am no longer checking for the following as they are being
// checked in the regular expression at the top.
// Find( "turnitinbot", strTempUserAgent ) OR
// Find( "ziggsbot", strTempUserAgent ) OR
// Find( "rufusbot", strTempUserAgent ) OR
// Find( "researchbot", strTempUserAgent ) OR
// Find( "ip2mapbot", strTempUserAgent ) OR
// Find( "gigabot", strTempUserAgent ) OR
// Find( "exabot", strTempUserAgent ) OR
// Find( "mj12bot", strTempUserAgent ) OR
// Find( "outfoxbot", strTempUserAgent ) OR
// Find( "obot", strTempUserAgent ) OR
// Find( "snapbot", strTempUserAgent ) OR
// Find( "myfamilybot", strTempUserAgent ) OR
// Find( "girafabot", strTempUserAgent ) OR
// Find( "aipbot", strTempUserAgent ) OR
// Find( "googlebot", strTempUserAgent ) OR
// Find( "becomebot", strTempUserAgent ) OR
// Find( "msnbot", strTempUserAgent ) OR
// Find( "irlbot", strTempUserAgent ) OR
// Find( "baiduspider", strTempUserAgent )
){
// This application definition is for robots that do NOT need sessions.
THIS.Name = "KinkySolutions v.1 {dev}";
THIS.SessionManagement = false;
THIS.SetClientCookies = false;
THIS.ClientManagement = false;
THIS.SetDomainCookies = false;
// Set the flag for session use.
REQUEST.HasSessionScope = false;
} else {
// This application is for the standard user.
THIS.Name = "KinkySolutions v.1 {dev}";
THIS.SessionManagement = true;
THIS.SetClientCookies = true;
THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);
THIS.LoginStorage = "SESSION";
// Set the flag for session use.
REQUEST.HasSessionScope = true;
}
This has been working great. The problem is that as I add different web crawlers the number of pre-page-processing steps I have to do increases. Not only that, I am finding that some people are slipping through the user_agent_test; I need to start "black listing" certain IP addresses as being spiders even though they have normal user agents. That means the first if() statement can get bigger and bigger. I don't care about the web crawler experience (how long the pages take to load), but I do care about the user's experience, and right now, the user is getting the raw end of the deal. Their normal page load is the worst-case-scenario for user agent checking (no user agent gets matched).
To help overcome this problem, I am using cookies to prevent the long IF statement from being executed more than once. I start out paraming these cookie values at the top of Application.cfc:
<cftry>
<cfparam name="COOKIE.SessionScopeTested" type="numeric" default="0" />
<cfparam name="COOKIE.HasSessionScope" type="numeric" default="0" />
<cfcatch>
<cfset COOKIE.SessionScopeTested = 0 />
<cfset COOKIE.HasSessionScope = 0 />
</cfcatch>
</cftry>
COOKIE.SessionScopeTested flags whether or not we have made the check for session management. COOKIE.HasSessionScope flags whether or not the user's page request has session management. Setting cookies in this way does NOT commit them to the user's cookie files, but it does create a temporary cookie value for the page request.
After CFParam'ing the cookie values, I also set a flag for whether or not the session management test was done (the big IF statement):
<!---
This is flag to see if the cookie test was performed. Will help us determine
if we need to commit a Cookie value later on in processing.
--->
<cfset blnCookieTestPerformed = false />
The updated IF statement for session management checking now looks like:
// Define the application. To stop unnecessary memory usage, we are going
// to give web crawler no session management. This way, they don't have
// to worry about cookie acceptance and object persistence (except for
// APPLICATION scope). Here, we are using short-circuit evaluation on the
// IF statement with the most popular search engines at the top of the
// list. This will help us minimize the amount of time that it takes to
// evaluate the list. Create a lowercase version of the user agent so we
// can run without NoCase checks.
strTempUserAgent = LCase( CGI.http_user_agent );
// Check user agent.
if (
(NOT Len(strTempUserAgent)) OR
// We are testing the cookie values so that we are not duplicating
// logic. This should provide a performance increase of anyone
// accepting cookies.
(
COOKIE.SessionScopeTested AND
(NOT COOKIE.HasSessionScope)
) OR
// We are gonna try to optimize even a little bit more. A good number
// of the spider names end in "bot". If we check for names that have
// BOT ending on a word boundary, we can eliminate several of the other
// spider checks. The bot\b search here takes care of the spiders that
// are now commented out below. As you can see, it takes the place of
// 18 different spider Find()'s.
REFind( "bot\b", strTempUserAgent ) OR
// This will try to get any RSS feed readers.
REFind( "\brss", strTempUserAgent ) OR
Find( "slurp", strTempUserAgent ) OR
Find( "mediapartners-google", strTempUserAgent ) OR
Find( "zyborg", strTempUserAgent ) OR
Find( "emonitor", strTempUserAgent ) OR
Find( "jeeves", strTempUserAgent ) OR
Find( "sbider", strTempUserAgent ) OR
Find( "findlinks", strTempUserAgent ) OR
Find( "yahooseeker", strTempUserAgent ) OR
Find( "mmcrawler", strTempUserAgent ) OR
Find( "jbrowser", strTempUserAgent ) OR
Find( "java", strTempUserAgent ) OR
Find( "pmafind", strTempUserAgent ) OR
Find( "blogbeat", strTempUserAgent ) OR
Find( "converacrawler", strTempUserAgent ) OR
Find( "ocelli", strTempUserAgent ) OR
Find( "labhoo", strTempUserAgent ) OR
Find( "validator", strTempUserAgent ) OR
Find( "sproose", strTempUserAgent ) OR
Find( "ia_archiver", strTempUserAgent ) OR
Find( "larbin", strTempUserAgent ) OR
Find( "psycheclone", strTempUserAgent ) OR
Find( "arachmo", strTempUserAgent ) OR
// These IP addresses are being black listed as being spiders
// even though they are not advertising themselves as being
// spiders via the CGI.http_user_agent.
(NOT Compare( CGI.remote_add, "208.66.195.5" ))
// I am no longer checking for the following as they are being
// checked in the regular expression at the top.
// Find( "turnitinbot", strTempUserAgent ) OR
// Find( "ziggsbot", strTempUserAgent ) OR
// Find( "rufusbot", strTempUserAgent ) OR
// Find( "researchbot", strTempUserAgent ) OR
// Find( "ip2mapbot", strTempUserAgent ) OR
// Find( "gigabot", strTempUserAgent ) OR
// Find( "exabot", strTempUserAgent ) OR
// Find( "mj12bot", strTempUserAgent ) OR
// Find( "outfoxbot", strTempUserAgent ) OR
// Find( "obot", strTempUserAgent ) OR
// Find( "snapbot", strTempUserAgent ) OR
// Find( "myfamilybot", strTempUserAgent ) OR
// Find( "girafabot", strTempUserAgent ) OR
// Find( "aipbot", strTempUserAgent ) OR
// Find( "googlebot", strTempUserAgent ) OR
// Find( "becomebot", strTempUserAgent ) OR
// Find( "msnbot", strTempUserAgent ) OR
// Find( "irlbot", strTempUserAgent ) OR
// Find( "baiduspider", strTempUserAgent )
){
// This application definition is for robots that do NOT need sessions.
THIS.Name = "KinkySolutions v.1 {dev}";
THIS.SessionManagement = false;
THIS.SetClientCookies = false;
THIS.ClientManagement = false;
THIS.SetDomainCookies = false;
// Set the flag for session use.
REQUEST.HasSessionScope = false;
// Only set the cookie values if we have not already done this. Most
// of the users who get this far are spiders and cannot accept
// cookies... which is why I am turning off session management.
// However, if they are spiders, I am not too concerned about
// performance so I set anyway for good practice.
if (NOT COOKIE.SessionScopeTested){
// Set the client cookie for testing so that this doesn't get
// tested again.
COOKIE.SessionScopeTested = 1;
// Set the client cookie for session availability. This user has
// been determined to not need sessions.
COOKIE.HasSessionScope = 0;
// Flag the cookie test as being performed.
blnCookieTestPerformed = true;
}
} else {
// This application is for the standard user.
THIS.Name = "KinkySolutions v.1 {dev}";
THIS.SessionManagement = true;
THIS.SetClientCookies = true;
THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);
THIS.LoginStorage = "SESSION";
// Set the flag for session use.
REQUEST.HasSessionScope = true;
// Only set the cookie values if we have not already done this.
if (NOT COOKIE.SessionScopeTested){
// Set the client cookie for testing so that this doesn't get
// tested again.
COOKIE.SessionScopeTested = 1;
// Set the client cookie for session availability. This user
// has been determined to allow session management.
COOKIE.HasSessionScope = 1;
// Flag the cookie test as being performed.
blnCookieTestPerformed = true;
}
}
Things to notice in the above code:
- As the SECOND test of the IF() statement, I am checking the user's cookie values for "COOKIE.SessionScopeTested AND (NOT COOKIE.HasSessionScope)". If the user is a regular user that accepts cookies than these values will evaluate to (1 AND (NOT 1)), which will return FALSE. And, since ColdFusion uses short-circuit evaluation for conditional statements, the entire IF statement will fail and the user will proceed directly to the ELSE statement.
This means that a regular user no longer has a worst-case scenario for session management testing. In fact, now, the user has the BEST case scenario as they are the only ones who can take advantage of this. Spiders cannot accept cookies and therefore, not only will the cookie logic be uselessly done on every page request, but they still have to execute at least part of the IF() statement. This is an experience that I want to have. Web spiders are more patient than users.
Now, until now, we have not actually set any permanent COOKIE values. We have only temporarily set values and referred to them. That is why, AFTER the IF() statement, we have this bit of code:
<!---
Check to see if the session management cookie values were updated (the
test was run). If so, we need to write the cookie value to the user's
browser. We could have done this above, but I didn't want to break out
of the CFScript block as it would not have looked nice.
--->
<cfif blnCookieTestPerformed>
<!--- Write the cookie value for the test. --->
<cfcookie
name="SessionScopeTested"
value="#COOKIE.SessionScopeTested#"
expires="NEVER"
/>
<!--- Write the cookie value for the test outcome. --->
<cfcookie
name="HasSessionScope"
value="#COOKIE.HasSessionScope#"
expires="NEVER"
/>
</cfif>
We only want to write the cookie values once since they never expire so we only write them if the test for session management has been performed and defined a cookie value. If you the code above, you will see that we can only ever set the cookie values if they have not been set yet.
So what has this accomplished? For the web crawlers, there is no change. They have a few more lines of code to have in their page pre-processing, but overall, nothing has changed except for an attempt to set cookie values which will fail. For the end user though, this should show a performance increase as we are cutting down on session management logic that they have to perform.
The page does set more values, COOKIE and temporary, but setting values in ColdFusion takes virtually no time.
Want to use code from this post? Check out the license.
Reader Comments
Ben,
Came across the post via CFGURU; was wondering a couple of years later if this is still your preferred method for combating spiders?
Brian
PS - Whenever I search for something CF related, you seem to have a post come back in the first 3-5 results on Google. Either we are solving the exact same problems or you are an SEO master! :)
@Brian,
This is pretty much the method I use right now for stopping extra sessions from being created. Its been working well, I think. I don't actually test this out, since I don't really take the time to poke around in SeeFusion or anything to see how much memory my site is using. But, it seems to work well.
Yeah, I get some good Google luck :)
Could you check for cgi.http_referer?
Generally spiders don't have this value (they won't tell you where they've come from).
So you could run any onSessionStart() without a cgi_referrer through your bot checker?
Just a thought...
@Geoff,
Definitely a very good thought. My only concern with that is that some people set their routers to strip out that kind of data on all outgoing requests. Not sure why, seems a tad paranoid to me :) But, I HAVE had that cause problems in the past - not in session management, but with other issues, so I am hesitant to use it.
It's definitely not required for the referrer to be provided so you can't count on this. It will get dumped in a lot of cases. You could use it as a component of a check but I think the best solution (using a cookies check) to date was posted on a mailing list yesterday. I summarized it on my blog at http://www.ghidinelli.com/2008/03/26/minimizing-memory-damage-from-bot-created-sessions-in-coldfusion/
Before yesterday I hadn't even considered the impact of these sessions and thanks to Ben and the others on the list, my server is now in better shape.
OK... what about Adobe AIR requests? These are not browser requests at all. What happens on these requests?
@John,
AIR requests are interesting. I have not played with AIR, so I am not sure what kind of UserAgent they present as. However, AIR requests don't need sessions since they are remote applications. As such, we can probably turn session off or keep them really short.
But, I cannot back that up with any experience :)
AIR requests are sessionless requests. So they should be treated as crawler requests. But if you need to connect to cfc's for data or whatever, it's normal to have a specific path for that with a extended Application.cfc, so you could remove the crawler stuff in there as that content is usually not intended to be crawlable.
@Michael,
Oh, nice idea! Very cool.
Do still you find it worth the extra effort to support cookieless browsers? If not, you can pick off nearly all of the spiders by doing something like:
<!--- check if client cookies enabled --->
<cfif not isBoolean(URLSessionFormat("true"))>
set a short session or no session
<cfelseif not structKeyExists(cookie, "COOKIESON")>
run your bot code and set a short or no session for bots, or a COOKIESON cookie for legit users
</cfif>
@Adam,
That's an interesting approach. I wonder how URLSessionFormat() knows if the client accepts cookies. Very clever!
Hello Ben, firstly - great blog! :-)
I've just had the issue with bots filling up our cfsession datastore over the past 4 months - 112 MILLION rows of CDATA! :-( This had a major adverse affect on our site performance.
I've been scouring the web to try and find the best way to solve this issue when I came across this blog post. I've implemented your solution above in our CFC, however, am noticing some weird side-effects.
1. Application OnStart seems to fire every few seconds. Would constantly changing the application configuration (ie session and client management settings) trigger this? We maintain an application.sessionCount variable and this is always being reset to 0 every few seconds. The only time this is actually set to 0 is in ApplicationOnStart.
2. The number of rows in the CDATA & CGBLOBAL tables are still growing at about 1,100 rows per hour - so this indicates to me that BOTs are still maxing out our cfsession table.
Looking forward to your response!
Cheers
Steven :-)
@Steven,
Very interesting; the application should only re-start if you are changing the application timeout or perhaps the name of the application. As long as you only mess with session timeouts, I can not think of why this would be an issue.
Is your application name / timeout dynamic at all?
Hi Ben, no, the timour is not dynamic at all. The two code chunks are:
// This application definition is for robots that do NOT need sessions.
this.name = "eClipse_KABS_Boomerang";
this.clientmanagement= false;
this.setDomainCookies = false;
this.sessionmanagement = false;
this.scriptProtect = "all";
.....and.....
// This application is for the standard user.
this.name = "eClipse_KABS_Boomerang";
this.clientmanagement= "yes";
this.setDomainCookies = "yes";
this.sessionmanagement = "yes";
this.SessionTimeout = CreateTimeSpan(0,1,0,0);
this.scriptProtect = "all";
It looks like I may have given you a bum-steer however, as further tests show that the ApplicationOnStart event is not firing multiple times (which is good) - the application.session couner however is somehow not working correctly.
This is the least of my worries however, as the CDATA & CGLOBAL tables are filling ar roughly 26,500 rows per day! I suspect that it's because bots are still hitting the site that has no cookie scope, and each hit is generating a new session in CF - would that be correct?
Any help would be appreciated (or point me in the right direction) on how to avoid this.
Sincerely
Steven :-)
@Steven,
I am not sure what CDATA & CGLOBAL tables are? Is that for the client-management aspect? I was told a long time ago to avoid client variables and have pretty much never looked into them again, so if those are standard tables for storage, I apologize for my ignorance. I don't fully understand how client variables work, especially when they are persisted or how they are even associated with a given request.