I'm trying to build an application with Apache Nutch that adds to the DB several documents, one for each item in a RSS Feed.
From my understanding, when now it parses a feed, it creates a unique Solr Document, with all the content concatenated
<item>
<title>Comment 1</title>
<link>http://www.link.com/a/#comment-2555842742</link>
<description>document text1</description>
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">12321 Borland</dc:creator>
<pubDate>Mon, 07 Mar 2016 06:48:35 -0000</pubDate>
</item>
<item>
<title>>Comment 2</title>
<link>http://www.link.com/a/#comment-2555590727</link>
<description>document text2</description>
<dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">12321</dc:creator>
<pubDate>Mon, 07 Mar 2016 00:48:34 -0000</pubDate>
</item>
Instead I would like to be able to return 2 ParseResult instead of only one: one for each item in the feed
By default the RSS feeds are parsed by the parse-tika
plugin, see https://github.com/apache/nutch/blob/master/conf/parse-plugins.xml#L31-L34, which by default identifies the links inside the RSS feed as Outlinks of the original feed URL. This outlinks are then stored for later fetching, parsing, etc. This could be checked if you run the command:
$ bin/nutch parsechecker http://humanos.uci.cu/feed/
The output should be something like:
...
---------
Url
---------------
http://humanos.uci.cu/feed/
---------
ParseData
---------
Version: 5
Status: success(1,0)
Title: humanOS
Outlinks: 10
...
This basically reports that 1 URL was successfully parsed and 10 outlinks were found.
To get the output that you want, you need to use the feed
plugin. So first, activate the feed
plugin in the plugin.include
attribute of your nutch-site.xml
file.
Once this is done you still need to instruct Nutch to use the feed
parser first (which uses the ROME library underneath). To accomplish this edit the conf/parse-plugins.xml
file, find the entry: <mimeType name="application/rss+xml">
and leave it like:
<mimeType name="application/rss+xml">
<plugin id="feed" />
<plugin id="parse-tika" />
</mimeType>
In this case if you try again the parsechecker
command the output will be different, and once you index into Solr/ES you should see more documents: 1 for the original feed plus one for each item in your feed.
Keep in mind that this new documents will only have as the content
field the description
extracted from the feed which could be fairly incomplete.
If you need to write a more customized logic, the ParseResult
class allows to have "subdocuments" (https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/parse/ParseResult.java#L30-L41).