Search code examples
web-crawlernutchwarc

Dump data from a Nutch crawl into multiple warc files


I have crawled a list of websites using Nutch 1.12. I can dump the crawl data into separate HTML files by using:

./bin/nutch dump -segment crawl/segments/ -o outputDir nameOfDir

And into a single WARC file by using:

./bin/nutch warc crawl/warcs crawl/segment/nameOfSegment

But how can I dump the collected data into multiple WARC files, one for each webpage crawled?


Solution

  • After quite a few attempts, I managed to find out that

    ./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment crawl/segments/segmentDir -warc
    

    does exactly what I needed: a full dump of the segment into individual WARC files!