Tuesday, August 6, 2013

How to crawl PDF documents using Nutch 1.6?

How to crawl PDF documents using Nutch 1.6?

I'm using Apache-nutch 1.6,my requirement is to crawl PDF documents as
.pdf file itself but I couldn't crawl pdf file as text itself. In my
nutch-site.xml, I'm giving
http.agent.name,http.robots.name,http.proxy.host alone..Is there anything
should I add... In my plugins I have only parse-tika, Is there anything to
add...If so suggest me the link...
I can crwal .html but for .pdf file no parsetext....
Error: parse.ParseUtil - Unable to successfully parse content
http://nutch.apache.orgmailing_lists.pdf of type application/pdf
parse.ParseSegment - Error parsing:
http://nutch.apache.org/mailing_lists.pdf: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse
content
Thanks in advance....

No comments:

Post a Comment