Skip to main content
Ben Nadel at CFUNITED 2008 (Washington, D.C.) with: Yaron Kohn
Ben Nadel at CFUNITED 2008 (Washington, D.C.) with: Yaron Kohn

Short-Circuit Evaluation Is Fast

By
Published in , Comments (9)

As I wrote some time ago, taking Michael Dinowitz's advice, I turn off session management for Spiders and Bots in an effort to cut down on memory usage on the server. See, spiders do not accept client cookies and therefore (on my sites) cannot hold sessions. Consequently, they start a new session for each page request they make. Since sessions take some time to timeout, this ends of creating large numbers of session variables that go unused (in proportion to the number of pages spidered).

When I first did this I used a Regular Expression (RegEx) to check for commonly known spider user agents (CGI.http_user_agent). It looked something like:

if (REFindNoCase( "slurp|googlebot|....", CGI.http_user_agent )){

This works great; however, I started adding more spiders to the list (as they started hitting my site) and I starting to fear that it wasn't efficient. If you ever look at how a regular expression works by using a program such as The RegEx Coach you can actually step through the RegEx path and you will see that for every character it comes across in the target sting, it does a LOT of logic for the regular expression. And, the larger the expression the more the logic.

This got me thinking about short-circuit evaluation. I am not sure which version brought this on board, but ColdFusion MX 7 has this feature, this optimization. This optimization means that evaluation of a relational expression in an IF statement is terminated just as soon as it is possible to tell what the result will be. Meaning that if you have several parts of a single IF statement and the first can determine the fate of the IF, then the remaining parts are not evaluated.

For example, in the following example, only the first value is checked:

if \(false AND true AND true AND true\)\{ ... \}

Since the "false" makes the statement false no matter what the rest of the arguments are, the remaining "true" statement are not even evaluated.

I have taken this idea and applied it to the problem of turning off session management for spiders. Instead of using a regular expression, I break out each comparison to its own sub-part of an IF statement:

// Define the application. To stop unnecessary memory usage, we are going
// to give web crawler no session management. This way, they don't have
// to worry about cookie acceptance and object persistence (except for
// APPLICATION scope). Here, we are using short-circuit evaluation on the
// IF statement with the most popular search engines at the top of the
// list. This will help us minimize the amount of time that it takes to
// evaluate the list.
if (
	(NOT Len(CGI.http_user_agent)) OR
	FindNoCase( "Slurp", CGI.http_user_agent ) OR
	FindNoCase( "Googlebot", CGI.http_user_agent ) OR
	FindNoCase( "BecomeBot", CGI.http_user_agent ) OR
	FindNoCase( "msnbot", CGI.http_user_agent ) OR
	FindNoCase( "Mediapartners-Google", CGI.http_user_agent ) OR
	FindNoCase( "ZyBorg", CGI.http_user_agent ) OR
	FindNoCase( "RufusBot", CGI.http_user_agent ) OR
	FindNoCase( "EMonitor", CGI.http_user_agent ) OR
	FindNoCase( "researchbot", CGI.http_user_agent ) OR
	FindNoCase( "IP2MapBot", CGI.http_user_agent ) OR
	FindNoCase( "GigaBot", CGI.http_user_agent ) OR
	FindNoCase( "Jeeves", CGI.http_user_agent ) OR
	FindNoCase( "Exabot", CGI.http_user_agent ) OR
	FindNoCase( "SBIder", CGI.http_user_agent ) OR
	FindNoCase( "findlinks", CGI.http_user_agent ) OR
	FindNoCase( "YahooSeeker", CGI.http_user_agent ) OR
	FindNoCase( "MMCrawler", CGI.http_user_agent ) OR
	FindNoCase( "MJ12bot", CGI.http_user_agent ) OR
	FindNoCase( "OutfoxBot", CGI.http_user_agent ) OR
	FindNoCase( "jBrowser", CGI.http_user_agent ) OR
	FindNoCase( "ZiggsBot", CGI.http_user_agent ) OR
	FindNoCase( "Java", CGI.http_user_agent ) OR
	FindNoCase( "PMAFind", CGI.http_user_agent ) OR
	FindNoCase( "Blogbeat", CGI.http_user_agent ) OR
	FindNoCase( "TurnitinBot", CGI.http_user_agent ) OR
	FindNoCase( "ConveraCrawler", CGI.http_user_agent ) OR
	FindNoCase( "Ocelli", CGI.http_user_agent ) OR
	FindNoCase( "Labhoo", CGI.http_user_agent ) OR
	FindNoCase( "Validator", CGI.http_user_agent ) OR
	FindNoCase( "sproose", CGI.http_user_agent ) OR
	FindNoCase( "oBot", CGI.http_user_agent ) OR
	FindNoCase( "MyFamilyBot", CGI.http_user_agent ) OR
	FindNoCase( "Girafabot", CGI.http_user_agent ) OR
	FindNoCase( "aipbot", CGI.http_user_agent ) OR
	FindNoCase( "ia_archiver", CGI.http_user_agent ) OR
	FindNoCase( "Snapbot", CGI.http_user_agent ) OR
	FindNoCase( "Larbin", CGI.http_user_agent ) OR
	FindNoCase( "psycheclone", CGI.http_user_agent ) OR
	FindNoCase( "ColdFusion", CGI.http_user_agent )
	){

	// This application definition is for robots that do NOT need sessions.
	THIS.Name = "KinkySolutions v.1 {dev}";
	THIS.SessionManagement = false;
	THIS.SetClientCookies = false;
	THIS.ClientManagement = false;
	THIS.SetDomainCookies = false;

	// Set the flag for session use.
	REQUEST.HasSessionScope = false;

} else {

	// This application is for the standard user.
	THIS.Name = "KinkySolutions v.1 {dev}";
	THIS.SessionManagement = true;
	THIS.SetClientCookies = true;
	THIS.SessionTimeout = CreateTimeSpan(0, 0, 20, 0);
	THIS.LoginStorage = "SESSION";

	// Set the flag for session use.
	REQUEST.HasSessionScope = true;

}

Now, regular expressions do short-circuit evaluation also, so the difference here is subtle. Let's say that we get a page request from a non-spider user agent. This is the "worst case" scenario since we will have to check every spider value against the string. With a regular expression, we would have to run through the matching processing for each of the (N) spider values for each of the (C) characters in the user agent. That's NxC iterations. However, in the compound IF statement, we would only have to run the matching process for each spider for each (U) instance of a user agent. That's just NxU and since U is always one, its just N number of iterations.

Now this is misleading because for string comparison, the substrings still have to match against many characters in the target string, but I am sure (but do not know for a fact) that literal matching must be faster than RegEx matching since there is not "logic" to literal matching.

If we do get a spider request that is a popular spider (higher in the IF statement, earlier in the regular expression), it's still faster to have the compound IF statement. See, the regular expression still needs to be checked in it's entirety for EACH character it comes across in the target string. But the IF statement only needs a sub-set of the IF sub-part run just once.

Of course, in practicality, they all run between 0-16ms per page hit. With large iterations (10,000+), the compound IF statement is levels of magnitude faster.

Furthermore, you can make it even faster by creating a temporary string of the LCase() of the user agent and then doing Find() rather than FindNoCase() for each sub-part (not shown above).

Want to use code from this post? Check out the license.

Reader Comments

15,848 Comments

@Andy,

A Switch statement would not quite work in a situation like this since we are not matching on the entire user agent, just parts of it.

2 Comments

For the record, I was not able to get this working w/ Model Glue because some Model Glue CFC accesses the Session Scope. (ModelGlue.unity.statebuilder.StateBuilder.cfc

I ended up using this approach as a way to not trigger some logging code, but memory variables are still floating out there.

For some odd reason the most common bot that hits my site is ColdFusion

18 Comments

I have something similar in mine but I use a different tack; I set the session timeout shortly for any agent without a cookie.jsessionid (using j2ee sessions).

You might consider using that as your first IF check since a bot won't have a cookie.jsessionid (or a cfid/cftoken).

34 Comments

Hey Ben

Assume that you develope this for a client who has no idea how to edit the Application.cfc file. Could you just simply store all the bots into a database and just loop over them?

Wouldn't that make it slightly a bit easier to manage? Or perhabs instead of checking for bots just check to see if the user is using an a common browser. I assume this would be an easier step considering there are more bots than there are browser types. Just saying...

15,848 Comments

@Brian,

Do the CFID / CFTOKEN values exist at that point in the first page request? I'd have to double-check on that.

@Jody,

True, there a number of ways to do this. To be honest, I rarely ever update this logic. I haven't even thought much about in the last few years. For all I know - there might be more bots hitting my site than I realize :)

Yeah, you could probably just check to see if there is a standard user agent.

18 Comments

Actually I may be wrong here - cookies get set early on in the request lifecycle even if they don't stick. I am pretty sure I need to use my own cookie rather than one of the built-ins. Like:

structKeyExists(cookie, "NEEDCOOKIES")

And set it AFTER that check, meaning your first request will get you a short timeout and your second will get you a regular one. We use a cookie like this anyways to verify people can register/stay signed in so I will need to look at using this with your bot check.

1 Comments

I am a bit confused as to why you would do this in the Application.cfc? So every time a bot visits your site, the application is going to loose session for all users. Where in the Application.cfc are you putting this code? OnRequestStart? Please explain how this does not corrupt the application scope.

I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel