Search code examples
solrluceneweb-crawlersemantic-webnutch

Nutch: Data read and adding metadata


I recently started looking apache nutch. I could do setup and able to crawl web pages of my interest with nutch. I am not quite understanding on how to read this data. I basically want to associate data of each page with some metadata(some random data for now) and store them locally which will be later used for searching(semantic). Do I need to use solr or lucene for the same? I am new to all of these. As far I know Nutch is used to crawl web pages. Can it do some additional features like adding metadata to the crawled data?


Solution

  • Useful commands.

    Begin crawl

    bin/nutch crawl urls -dir crawl -depth 3 -topN 5
    

    Get statistics of crawled URL's

    bin/nutch readdb crawl/crawldb -stats
    

    Read segment (gets all the data from web pages)

    bin/nutch readseg -dump crawl/segments/* segmentAllContent
    

    Read segment (gets only the text field)

    bin/nutch readseg -dump crawl/segments/* segmentTextContent -nocontent -nofetch -nogenerate -     noparse -noparsedata
    

    Get all list of known links to each URL, including both the source URL and anchor text of the link.

    bin/nutch readlinkdb crawl/linkdb/ -dump linkContent
    

    Get all URL's crawled. Also gives other information like whether it was fetched, fetched time, modified time etc.

    bin/nutch readdb crawl/crawldb/ -dump crawlContent
    

    For the second part. i.e to add new field I am planning to use index-extra plugin or to write custom plugin.

    Refer:

    this and this