Skip to main content
Ben Nadel at InVision In Real Life (IRL) 2018 (Hollywood, CA) with: Lindsey Redinger
Ben Nadel at InVision In Real Life (IRL) 2018 (Hollywood, CA) with: Lindsey Redinger

Ask Ben: Getting The Domain Name From The Referer URL

By
Published in , Comments (18)

Hi Ben Nadel, I want to use regex code to extract only domain name for http referrers, can you please give me clue? thanks.

Normally, when we think about the domain name of a URL (which is what the CGI.http_referer value is), we think of the domain name as the part of the URL that comes after the protocol (http://, https://, ftp://, etc.). As such, we could grab the domain name by using a positive look behind on the protocol. However, the regular expression engine that ColdFusion uses (POSIX) does not allow for look behinds. As such, we can either reach down into the Java layer and user the Java Pattern library; or, we can hack the POSIX engine to do what we need.

Since reaching down into the Java layer is probably overkill for such a use case, I will instead explore two different methods to use the POSIX regex engine to get what we need. One method will use the REReplace() function and one method will use the REMatch() function (only available in ColdFusion 8 or later).

Using REReplace() To Get The Domain Name

While it might not seem intuitive to use a replace function to extract part of a string, if we replace the entire string with the substring that we are seeking, what do we end up with? Our target substring. Because the REReplace() function allows use to capture groups in our regular expression and then use them within our "replacement text," we have the ability to replace our original string with just our target string:

<!---
	Define a referer. Normally this would come out of the CGI
	scope, but for now, we are going to simulate it.\
--->
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />

<!---
	Because standard ColdFusion regex (POSIX) does not give us
	nice look behind functionality, we don't have a nice way to
	gather the match using reMathc(). As such, the easiest way we
	can get the referring domain is to replace the entire string
	with the captured domain.

	Notice that in the following reReplace() statement, we are
	matching the entire string; but, we are only capturing the
	domain in group one (\1). Then, we replace the entire string
	(referer value) with the contents of the first group (domain).
	This leaves us with just the domain name of the referer.
--->
<cfset refererDomain = reReplace(
	referer,
	"^\w+://([^\/:]+)[\w\W]*$",
	"\1",
	"one"
	) />

<!--- Output the referer. --->
Referer Domain: #refererDomain#

Notice that in the regular expression pattern, we are matching the entire URL, but we are only capturing the domain value. In doing so, it allows us to reference the captured domain in our replacement text. And, as you can see, we are replacing the entire URL with the value of the captured group - our domain. And, when we run this code, we get the following output:

Referer Domain: www.shemuscle.com

Using REMatch() To Get The Domain Name

If you are using ColdFusion 8, you can use the REMatch() function to gather all matches in a given string. We can use this function to match parts of the target URL and then pluck the domain name out of the returned matches. Because regular expressions are evaluated from left to right in a greedy fashion, we can have our regular expression pattern match parts of the domain moving from left to right; first, we'll match the protocol, then the domain name, then the rest of the string:

<!---
	Define a referer. Normally this would come out of the CGI
	scope, but for now, we are going to simulate it.\
--->
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />

<!--- Extract the various parts of the URL. --->
<cfset urlParts = reMatch(
	"^\w+://|[^\/:]+|[\w\W]*$",
	referer
	) />

<!--- Output the parts we captured: --->
<cfloop
	index="urlPart"
	array="#urlParts#">

	Part: #urlPart#<br />

</cfloop>

Notice in this code that our regular expression matches the three crucial parts of the domain. And, when we run this code, we get the following output:

Part: http://
Part: www.shemuscle.com
Part: /category/anonymous/

In this case, the domain name is the second item matched and can be extracted from the matches using urlParts[ 2 ].

Using java.net.URL To Get The Domain Name

As a final method, let's quickly explore the Java URL object. In the previous examples, we had to do all of the heavy lifting ourselves in terms of figuring out how to parse the URL using regular expressions. Well, if we use the Java class, java.net.URL, we can offload that heavy lifting. If we create an instance of the Java URL class and initialize it with our target URL, it will parse the URL internally and give use access to the URL components:

<!---
	Define a referer. Normally this would come out of the CGI
	scope, but for now, we are going to simulate it.\
--->
<cfset referer = "http://www.shemuscle.com/category/anonymous/" />

<!--- Create a Java URL object based on our referer URL. --->
<cfset javaUrl = createObject( "java", "java.net.URL" ).init(
	javaCast( "string", referer )
	) />

<!---
	The Java url has parsed the url for us and we can now extract
	the components from our Java url instance.
--->
Referer Domain: #javaUrl.getHost()#

Notice that all we have to do is create the URL instance and pass in our referer URL. The Java URL takes care of the rest. Then, all we have to do is ask it of for the domain name (host) of the given URL. And, when we run the above code, we get the following output:

Referer Domain: www.shemuscle.com

Works like a charm and we didn't have to get our hands dirty with any regular expressions.

Regular expression are a great tool in the programming toolbox; and, they are amazing for string parsing. But sometimes, we can offload the processing of strings to existing pieces of functionality like the Java URL class, and get what we need without any of the complexity associated with regular expressions. I hope this helps!

Want to use code from this post? Check out the license.

Reader Comments

14 Comments

Some nice methods there Ben; I certainly didn't know about the JAVA method. For a similar thing, I've gone down the following route:

<cfset referer = "http://www.shemuscle.com/category/anonymous/" />
<Cfset a=ListToArray(referer,"/")>
<cfdump var="#a[2]#">

This is obviously assuming there is a "http://" at the front of the referer or URL but a simple check can be put in place to detect that and amend the output accordingly. I always tend to shy away from reg ex's due to never quite "getting" them.

Tom

15,848 Comments

@Tom,

Tom, nice one! List usage is definitely an easy and straightforward way to go! I suppose we could also use ListToArray() and access it that way as well!

Awesome tip!

11 Comments

It's good to see Ben always giving multiple solution :) I didn't know about the JAVA method as well.

I generally just do: ListGetAt(referer,2,'/')

Of-course a check is required to make sure list has at-least 2 item.

15,848 Comments

@Todd,

That looks like an intense UDF. That Dan Switzer is a really brilliant programmer.

@Sumit,

Thank my man. Yeah, I totally forgot about using lists :)

19 Comments

Speaking of regex, links and all that is there any way possible to get the value of a href?

I know how to get the actual anchor text but not the href value.

19 Comments

@Ben

The anchor tags are pulled from everywhere. There is no one set place that I pull them from.

We can use wikipedia as an example as this is what I'm currently working on.

15,848 Comments

@Jody,

The easiest thing would probably be to extract all of the anchor tags, then from each of those, extract the HREF value. I think trying to go directly to the HREF value might be overly complicated.

48 Comments

Jody: Does it need to be done on the server side? It would be really easy with jQuery on the client side.

$('a').each(function(){alert($(this).attr('href'))});

19 Comments

Actually it does need to be done server side. I wish I could do it client side that would save a lot of resources but at any rate I have figured it out. I just converted the whole page into arrays that are delimted by this

href=

I then delimted that array by a " so I can call on each URL pretty easily. If someone wants the code you can just email me

creditprovided[at-sym]ymail.com

It's really simple when you actually think about it.

But thanks for helping me out I really appreciate it.

15,848 Comments

@Jody,

When you do that, you just have to be careful if the page has any instances of "href=" that are not part of actual HREF tags. For example, pages that have sample code on it would have href tags that are not true HREF tags. But, that said, sounds like it's working for you, so I'm not gonna rock the boat.

2 Comments

For anyone trying to programmatically do stuff with the anchor on the response page besides the default behavior of the browser scrolling to the named anchor, the browser doesn't send it to the server so you won't get it in the cgi scope.

Your only recourse is to process it with javascript on the response page you build. The anchor value is

location.href.split("#")[1]

If your response page has jQuery, the anchor is "myAnchor" and the id of the div you want to highlight is "foo"...

<script type="text/javascript">
$(document).ready(function(){
var anch=location.href.split("#")[1];
if(anch=="myAnchor")
$("#foo").css("background-color","yellow");
});
</script>
&lt;a name="myAnchor"></a>
<div id="foo">hello world</div>
I believe in love. I believe in compassion. I believe in human rights. I believe that we can afford to give more of these gifts to the world around us because it costs us nothing to be decent and kind and understanding. And, I want you to know that when you land on this site, you are accepted for who you are, no matter how you identify, what truths you live, or whatever kind of goofy shit makes you feel alive! Rock on with your bad self!
Ben Nadel