Search code examples
javaapachemeta-tagsnutch

Fetch particular tags from HTML docs obtained after crawling and parsing using Apache Nutch 1.4


I used nutch 1.4 and crawled a website. I got the website crawled successfully and all the pages were dumped into segments. I merged all the segments to one segment and then i used readseg command to obtain a text version of all the crawled pages. Now I need to find out, URL of page and the meta data stored in that page. I don't know which command to use or shall i need to do something different.

Have made a lot of efforts on google Some people said that you have to write a separate plugin for it. Can someone tell me please.

Thanks a lot :) :)


Solution

  • Finally, I am able to do it. Sharing in case someone else needs it. You can use index-metatags plugin provided here: http://wiki.apache.org/nutch/IndexMetatags

    It will solve this problem Cheers :)