Search code examples
javaweb-crawlernutch

How to read from Nutch segments without readseg command


I'm using Nutch to crawl some websites, exactly I am crawling this site.

I have got these five segments with all the documents found (around 10.000 documents). Now I want to process the content of the documents without using the readseg command, this is, not dumping the segments into plain text.

For this, only the subdirectory content of each segment is useful for me (the tags and the content of the document).

I have realised that inside the content directory there are two more containers: data and index. However I haven't found any explanation of them, and how can I read them to process the content inside. I have also found some pointers to this question, but I have not yet understood the algorithm idea.

How is the content stored in a Nutch segment, and how can it be read? I have given the collection website and segments if a short example wants to be given (but not necessary).


Solution

  • What do you need to do with the content? you could for instance write a custom IndexWriter. It would be invoked during the indexing step and would give you access to the content. Alternatively look at the 'dump' command (org.apache.nutch.tools.FileDumper) and modify the code.

    BTW 'Hadoop the Definitive Guide' by Tom White has a nice chapter on the Nutch data structures.

    If you want to do further processing of the pages, like NLP or classification, Behemoth can be used to convert Nutch segments into a 'neutral' datastucture on HDFS which can then be processed with various tools.