Screen-Scraping Movie Showtimes Off Google.com With ColdFusion
Yesterday, I experimented with scraping movie showtimes off of the iPhone version of Fandango.com. Today, I wanted to try and do the same thing with the Google.com movie showtimes service. This actually provides an interesting context because it's two very different approaches to the same problem. With Fandango.com, we get XHTML that is so compliant that we can actually parse it into XML and use XPath to query the document. Google, on the other hand, is so conscious about bandwidth usage that they make their HTML as dirty and as incomplete as possible so long as it still renders properly. As such, when we deal with Google's markup, we have to fall back to string parsing and pattern matching rather than DOM querying.
Because I was solving the same problem, I actually wanted to build the same API. So, for this demo, you'll see that the ColdFusion code is almost exactly the same as the code used in the Fandango.com demo:
<!--- Create an instance of the Google movie component. --->
<cfset google = createObject( "component", "Google" ).init() />
<!---
Get the theater information for the showtimes at the
Regal Union Square Stadium 14 theater TODAY.
ID: 10dd19bd6f57c7c8 - Regal Union Square Stadium 14
ID: 14c321fe7754e274 - AMC Empire 25
NOTE: I had to get the theater ID off the website itself.
--->
<cfset theaterInfo = google.getTheaterInfo( "10dd19bd6f57c7c8" ) />
<!--- Output theater information. --->
<cfoutput>
<p>
<strong>#theaterInfo.title#</strong><br />
</p>
<!---
Loop over the movies to output the movie titles and
the times they are showing.
--->
<cfloop
index="movie"
array="#theaterInfo.movies#">
<p>
<strong>#movie.title#</strong><br />
#arrayToList( movie.showtimes, ", " )#
</p>
</cfloop>
</cfoutput>
Pretty much, the only difference here is that I am instantiating a ColdFusion component called "Google" rather than one called "Fandango." Both of these CFCs have the same public API, which is the method, getTheaterInfo(). This method returns the same structure in both cases. This is the nicest thing about creating an API - that you can change the underlying engine without changing the code that relies on it.
When we run the above code, we get the following page output:
NOTE: Movie data removed at the request of data owner.
The ColdFusion component that powers this is somewhat less complex than the Fandango one because all the movies are listed on one page. In the Fandango version, I had to make several CFHTTP page requests to gather all of the showtime information; but on Google, it's all right there. Of course, this time, I have to rely on Regular Expression pattern matching rather than XPath; but it's not too much more complex.
Google.cfc
<cfcomponent
output="false"
hint="I help screen scrape the Google Movie showtimes.">
<cffunction
name="init"
access="public"
returntype="any"
output="false"
hint="I return an initialized component.">
<!--- Define arguments. --->
<cfargument
name="baseURL"
type="string"
required="false"
default="http://google.com/movies"
hint="I am the base URL for the HTTP requests."
/>
<!--- Store properties. --->
<cfset this.baseURL = arguments.baseURL />
<!--- Return this object reference. --->
<cfreturn this />
</cffunction>
<cffunction
name="getTheaterInfo"
access="public"
returntype="struct"
output="false"
hint="I parse the showtimes for the given theater ID.">
<!--- Define arguments. --->
<cfargument
name="theaterID"
type="string"
required="true"
hint="I am the theater ID used by Fandango."
/>
<!--- Define the loacl scope. --->
<cfset var local = {} />
<!--- Define the theater structure. --->
<cfset local.theaterInfo = {
id = arguments.theaterID,
title = "",
movies = []
} />
<!---
Grab the HTML off of the Google web page. With Google,
you typically have to send some sort of User Agent
because it will block a lot of user agents that it
considers "bots."
--->
<cfhttp
result="local.googleGet"
method="get"
url="#this.baseURL#?tid=#arguments.theaterID#"
useragent="Mozilla/BenNadel.com"
/>
<!---
While the HTML of the Google page is horrendously
incomplete, it is thankfully well Classed enough to
make string parsing somewhat straightfoward.
--->
<!--- Grab the theater title div. --->
<cfset local.theaterDiv = reMatch(
"<div class=theater>[\w\W]+?</span>",
local.googleGet.fileContent
) />
<!---
Get the theater title by stripping out all tags from
the theater DIV. There is an H2 in there somewhere that
has our theater name.
--->
<cfset local.theaterInfo.title = trim(
reReplace(
local.theaterDiv[ 1 ],
"( |</?\w+[^>]*>)",
" ",
"all"
)
) />
<!--- Each movie is wrapped in a "movie" DIV that we can
extract with some regular expression matching.
--->
<cfset local.movieDivs = reMatch(
"<div class=movie>(?:\s|<(\w+)[^>]*>.+?</\1>)+",
local.googleGet.fileContent
) />
<!---
At this point, we have chunks of strings that contain
the movie data. Now, we have to loop over each one and
parse the details.
--->
<cfloop
index="local.movieDiv"
array="#local.movieDivs#">
<!--- Parse out the movie name DIV. --->
<cfset local.nameDiv = reMatch(
"<div class=name>.+?</div>",
local.movieDiv
) />
<!--- Parse out the showtimes DIV. --->
<cfset local.showtimesDiv = reMatch(
"<div class=times>.+?</div>",
local.movieDiv
) />
<!---
Create a movie struct from the parsed DIVs. For
this, we are basically going to take the pasred
DIVs and strip out all tags, leaving just the
textual data.
--->
<cfset local.movie = {
title = trim(
reReplace(
local.nameDiv[ 1 ],
"</?\w+[^>]*>",
" ",
"all"
)
),
showtimes = listToArray(
reReplace(
local.showtimesDiv[ 1 ],
"( |</?\w+[^>]*>)",
" ",
"all"
),
" "
)
} />
<!--- Append the movie to the ongoing collection. --->
<cfset arrayAppend(
local.theaterInfo.movies,
local.movie
) />
</cfloop>
<!--- Return the result. --->
<cfreturn local.theaterInfo />
</cffunction>
</cfcomponent>
As you can see, this version of the showtimes screen scraper relies entirely on reMatch() rather than xmlSearch(). But, just because this version approaches the problem in a different way, it doesn't mean that it is any less susceptible to problems. In either case, we are still depending on the predictable structure of a 3rd party page that we do not control. If that structure changes without notice, whether we use XML parsing or string pattern matching, our code might very well break.
In the long term, Google's markup, while significantly incomplete, seems to be easier to work with simply because it's all on one page and has better CSS class hooks (for pattern matching). If I am gonna play around more with screen scraping movie showtimes, I'll probably be using this service to do so.
Want to use code from this post? Check out the license.
Reader Comments
Ben,
I have been trying to figure out an approach to screen-scraping for the website we build to aggregate event information around the state. Trying to get everyone to update their information is always an issue.
I'm not too swift when it comes to the whole issue of screen-scraping, then moving this data to a database. I've had limited success and it is different for each site.
What I would like to do is pull down event information form numerous sites, dump the information into a database, then upload it to our site. Is there an approach to this that is feasible? Am I just thinking too much and not working hard enough?
thanks.
@Scott,
Screen-scraping is never the, "right," solution; however, sometimes it is the "only" solution available. When I was working on Skin-Spider waaaay back in the day (it screen-scraped adult content), what I did was create a uniform CFC interface for the concept of screen scraping. Then, I created a separate CFC for each target website that uphelp the "scraping interface", but internally was set up specially for that site (based on its HTML and what not).
It's not an easy approach, for sure; and, it is likely to break if / whenever they change the markup. But, if it's all you go, abstracting it out into individual CFCs is really beneficial.
Also, if you are really serious about this, it can be a godsend to run the HTML through an "XHTML cleaner" first such that you can actually use xmlSearch(). That's what I was using TagSoup for a while back:
www.bennadel.com/blog/1723-Parsing-Invalid-HTML-Into-XML-Using-ColdFusion-Groovy-And-TagSoup.htm
This is a more complex solution since it used Groovy to load the JAR, which was then used to clean the HTML, which was then used by ColdFusion / xmlSearch. But, if you look, once you do that, you can treat the target HTML like it is XML, which makes scrapping MUCH easier.
@Scott,
You might also look into YQL (Yahoo Query Language). They have some serious support screen-scraping that I think does all of the XML/XHTML cleaning for you. I haven't looked into that much though.
@Ben Nadel,
Ben, thanks for the info on screen-scraping. Not something I want to think about, but I may take a shot at a few of the sites to see how it all works. Thanks again.
@Scott,
No problem my man. If you hit any walls, drop a note here.
Hi there, I'm not sure what happened to your blog about the fandango site (I can't get to it any longer) so I'm posting this note on this blog entry.
Fandango seems to have changed their format drastically in the past few days. You may want to check it out and update and repost your blog about it.
regards,
Royce
@Royce,
Fandango sent me "cease and desist" order. Apparently my blog post violated the part of their Terms of Service that prevented me from "facilitating the unauthorized used" of their data. Oh well :)
Oh well, I guess it was too good to last. In the spirit of the internet, you should display the C&D letter where your old blog post was.
I've been looking for a reason to move to The Movie DB (http://api.themoviedb.org/2.1/) API, so I guess this is it.
Buh bye Fandango, I guess my sending folks your way to buy tickets wasn't good enough for ya!