Ask Ben: Grabbing Google Results with CFHttp
I am getting errors when I try to grab google results with cfhttp. But, when I go to page with my browser, it works just fine. What am I doing wrong?
You are not doing anything wrong. Google wants to be used by regular web users. CFHttp does not announce itself as a regular user. When you do a CFHttp page grab, it passes along, as its User Agent a non-standard value. I am not sure offhand what it is, but I think it sends "ColdFusion" as its user agent. Doing a regular CFHttp will return this error:
Your client does not have permission to get URL /search?hl=en&lr=&q=upsidedown+dogs&btnG=Search from this server.
This is there for a reason: you might be violating the Google terms of service (I have not read them, nor do I condone working around this). If you want to avoid this, you can fake Google into thinking you ARE a web browser by sending a standard user agent in your CFHttp:
<!--- Grab the google search results. --->
<cfhttp
url="http://www.google.com/search?hl=en&lr=&q=upsidedown+dogs&btnG=Search"
useragent="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; FDM)"
result="objGoogleGrab"
method="GET"
resolveurl="true"
/>
<!--- Output the search results. --->
<cfoutput>
#objGoogleGrab.FileContent#
</cfoutput>
Notice that I am sending the FireFox / Mozilla user agent. This should work just fine. But again, I am not aware of the legality of such an action - proceed with caution.
Want to use code from this post? Check out the license.
Reader Comments
Great post on pulling Google search results pages.
Now if I could only figure out how to get it to only pull
the result for a particular site's listing.
Trying to analyze the different result text for one domain
for different keyword searches.
@Dr. Adam,
Just put "site:youdomain.com" in the google query and it should only pull for a particular site.
Thanks for the post. I didn't even think about using cfhttp to grab google results. Any way to just grab the results and not the rest of the google page that appears?
@Keith,
You could use markers in the page to probably only grab the start / end of the results. However, you might be better off seeing if Google has some search API that fits your desires more easily.
After posting my comment I found their custom search service and that works like a charm.
@Keith,
Ok great.
OK, I'm trying to use this with a Google Site Search custom search engine. Thought this fixed my first problem, now I get a fully formatted HTML page, insteadof straight XML. If I go to the link directly inthe browser, I get XML.
I looked at the WoW example, but when I run those code snippets, I STILL get HTML for both of those.
I'm very confused. :(
@Thane,
Are you sure you're passing through the same user agent that your browser has? Try hitting a CFM page and outputting the http_user_agent to see what's posting. Then, post that to your search page.
Is there a way to make the search terms user in-putted? And to make it so only the first 10 results show?