Data Mining Document Text for Searching
Right now I am in the process of getting this site up and running, and one of the milestones in that journey is the site search. Currently, the search works for the database content of the web log and the snippets. More will come as the site evolves. Once of the details of site search is the uploaded document search. Snippets, for example, can have an uploaded sample code file. In order for text of this document to return the Snippet itself, I have to be able to search the content of the document.
I have tried using Verity over the years, and frankly, it's always more headache than it has been worth. Now, granted, maybe I am not the best at setting it up, but it just always has so much setup/using cost. Not to mention that the CFMX7 version tends to crash our server. So right now, what I am trying to do is strip out data from the document and store that in the database along with the file info and association info. So far, it has been working nicely. I can strip the text out of text documents, html, htm, word, excel, etc. The one beast I am having trouble with right now is the PDF. The dreaded PDF. I think I am going to have to go Third-Party on this one, unless I can figure out a built-in Java way to extract text.
Now, even though the content that I get out of the documents is not 100% spot on (some words get deleted, punctuation gets removed), I still keep the gist of the content, and frankly, I think that's good enough to search on.
Reader Comments
Hi Ben,
I came across your article but found no solution. I am attempting to perform a screen scrape from an online PDF which is dynamically populated via a URL.
For example:
http://www.domain.com/test.pdf?var=1
Id like to pull that data into a variable and perform a screen scrape. That would be the best. Any ideas on the best method for this?
This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.
http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdfview_1&handle=euclid.ssu/1216238228