I recently started looking apache nutch. I could do setup and able to crawl web pages of my interest with nutch. I am not quite understanding on how to read this data. I basically want to associate data of each page with some metadata(some random data for now) and store them locally which will be later used for searching(semantic). Do I need to use solr or lucene for the same? I am new to all of these. As far I know Nutch is used to crawl web pages. Can it do some additional features like adding metadata to the crawled data?
Useful commands.
Begin crawl
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
Get statistics of crawled URL's
bin/nutch readdb crawl/crawldb -stats
Read segment (gets all the data from web pages)
bin/nutch readseg -dump crawl/segments/* segmentAllContent
Read segment (gets only the text field)
bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate - noparse -noparsedata
Get all list of known links to each URL, including both the source URL and anchor text of the link.
bin/nutch readlinkdb crawl/linkdb/ -dump linkContent
Get all URL's crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.
bin/nutch readdb crawl/crawldb/ -dump crawlContent
For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.
Refer: