ALAN: How to crawl PDF documents using Nutch 1.6?

Tuesday, August 6, 2013

How to crawl PDF documents using Nutch 1.6?

How to crawl PDF documents using Nutch 1.6?

I'm using Apache-nutch 1.6,my requirement is to crawl PDF documents as
.pdf file itself but I couldn't crawl pdf file as text itself. In my
nutch-site.xml, I'm giving
http.agent.name,http.robots.name,http.proxy.host alone..Is there anything
should I add... In my plugins I have only parse-tika, Is there anything to
add...If so suggest me the link...
I can crwal .html but for .pdf file no parsetext....
Error: parse.ParseUtil - Unable to successfully parse content
http://nutch.apache.orgmailing_lists.pdf of type application/pdf
parse.ParseSegment - Error parsing:
http://nutch.apache.org/mailing_lists.pdf: failed(2,200):
org.apache.nutch.parse.ParseException: Unable to successfully parse
content
Thanks in advance....

ALAN

Tuesday, August 6, 2013

How to crawl PDF documents using Nutch 1.6?

No comments:

Post a Comment