Search code examples
apachesolrrssnutch

Apache Nutch one document for each item in RSS Feed


I'm trying to build an application with Apache Nutch that adds to the DB several documents, one for each item in a RSS Feed.

From my understanding, when now it parses a feed, it creates a unique Solr Document, with all the content concatenated

<item>
     <title>Comment 1</title>
     <link>http://www.link.com/a/#comment-2555842742</link>
     <description>document text1</description>
     <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">12321 Borland</dc:creator>
     <pubDate>Mon, 07 Mar 2016 06:48:35 -0000</pubDate>
  </item>
  <item>
     <title>>Comment 2</title>
     <link>http://www.link.com/a/#comment-2555590727</link>
     <description>document text2</description>
     <dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">12321</dc:creator>
     <pubDate>Mon, 07 Mar 2016 00:48:34 -0000</pubDate>
  </item>

Instead I would like to be able to return 2 ParseResult instead of only one: one for each item in the feed


Solution

  • By default the RSS feeds are parsed by the parse-tika plugin, see https://github.com/apache/nutch/blob/master/conf/parse-plugins.xml#L31-L34, which by default identifies the links inside the RSS feed as Outlinks of the original feed URL. This outlinks are then stored for later fetching, parsing, etc. This could be checked if you run the command:

    $ bin/nutch parsechecker http://humanos.uci.cu/feed/
    

    The output should be something like:

    ...
    ---------
    Url
    ---------------
    
    http://humanos.uci.cu/feed/
    ---------
    ParseData
    ---------
    
    Version: 5
    Status: success(1,0)
    Title: humanOS
    Outlinks: 10
    ...
    

    This basically reports that 1 URL was successfully parsed and 10 outlinks were found.

    To get the output that you want, you need to use the feed plugin. So first, activate the feed plugin in the plugin.include attribute of your nutch-site.xml file.

    Once this is done you still need to instruct Nutch to use the feed parser first (which uses the ROME library underneath). To accomplish this edit the conf/parse-plugins.xml file, find the entry: <mimeType name="application/rss+xml"> and leave it like:

    <mimeType name="application/rss+xml">
        <plugin id="feed" />
        <plugin id="parse-tika" />
    </mimeType>
    

    In this case if you try again the parsechecker command the output will be different, and once you index into Solr/ES you should see more documents: 1 for the original feed plus one for each item in your feed.

    Keep in mind that this new documents will only have as the content field the description extracted from the feed which could be fairly incomplete.

    If you need to write a more customized logic, the ParseResult class allows to have "subdocuments" (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/parse/ParseResult.java#L30-L41).