solr lucene web-crawler semantic-web nutch

Nutch: Data read and adding metadata

I recently started looking apache nutch. I could do setup and able to crawl web pages of my interest with nutch. I am not quite understanding on how to read this data. I basically want to associate data of each page with some metadata(some random data for now) and store them locally which will be later used for searching(semantic). Do I need to use solr or lucene for the same? I am new to all of these. As far I know Nutch is used to crawl web pages. Can it do some additional features like adding metadata to the crawled data?

Solution

Useful commands.

Begin crawl

bin/nutch crawl urls -dir crawl -depth 3 -topN 5

Get statistics of crawled URL's

bin/nutch readdb crawl/crawldb -stats

Read segment (gets all the data from web pages)

bin/nutch readseg -dump crawl/segments/* segmentAllContent

Read segment (gets only the text field)

bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate -     noparse -noparsedata

Get all list of known links to each URL, including both the source URL and anchor text of the link.

bin/nutch readlinkdb crawl/linkdb/ -dump linkContent

Get all URL's crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.

bin/nutch readdb crawl/crawldb/ -dump crawlContent

For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.

Refer:

this and this