Parsing paragraphs into separate documents in Solr using script

I would like to crawl through a list of sites using Nutch, then break up each document into paragraphs and sending them to Solr for indexing.

I have been using the following script to automate the process of crawling/fetching/parsing/indexing:

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/#/nutch -s ./urls/ Crawl 2

My idea is to attach a script in the middle of this workflow (probably the parsing stage of Nutch?) that would break up the paragraphs, like paragraphs.split(). How could I accomplish this?

Additionally, I need to add a field to each paragraph that shows its numerical position in the document, and to what chapter it belongs to. The chapter is an h2 tag in the document.

Solution

Currently, there is not a very easy answer to your question. To accomplish this you need custom code, specifically, Nutch has two different plugins to deal with parsing HTML code parse-html and parse-tika. These plugins are focused on extracting text content and not so much structured data out of the HTML document.

You would need to have a custom parser (HtmlParserPugin) plugin that will treat paragraph nodes within your HTML document in a custom way (extracting the content and positional information).

The other component that you would need is for modeling the data in Solr, since you need to keep the position of the paragraph within the same document you also need to send this data in a way that it is searchable in Solr, perhaps using nested documents (this really depends on how you plan to use the data).

For instance, you may take a look at this plugin which implements custom logic for extracting data using arbitrary X Path expressions from the HTML.