Search code examples
solrhbasehtml-parsingnutchhtml-parser

Are there any Nutch plugins to parse html body?


I'm using nutch 2.2.1 , hbase 0.9 for storing data and Apache Solr to search it. These are my basic indexed fields

<float name="boost">0.10625245</float>
<str name="digest">5ef9408b2c4692d2c8c7ed24c1b38863</str>
<str name="id">org.wikipedia.it:https/wiki/1767</str>    
<str name="title">1767 - Wikipedia</str>
<date name="tstamp">2017-12-21T17:00:30.293Z</date>
<str name="url">https://it.wikipedia.org/wiki/1767</str>

I want to parse and store the content of html-body of crawled web pages. Have i to write a Nutch plugin to do it, or there are some config to enable it?I can't find any solution on Nutch site.


Solution

  • I would say that you're missing the content field. If you take a look at https://github.com/apache/nutch/blob/2.x/conf/solrindex-mapping.xml#L34 You'll see that one of the default fields is the content.

    Check with the bin/nutch parsechecker tool if the content is being extracted for your URLs. And then test with bin/nutch indexchecker if the indexer is also extracting the content field. Lastly, check your mappings.

    Keep in mind that the content will be the textual content extracted by the parser and not the raw HTML content.