I am trying to Crawl the DBpedia with Apache Nutch 1.15, but i'm having problems with parsing RDF files.
On the parsing phase, i only get this message:
**apache_nutch | Error parsing: http://dbpedia.org/data/Moscow.xml: failed(2,0): Can't retrieve Tika parser for mime-type application/rdf+xml **
following this reference, i configured my parse-plugins.xml to parse application/rdf+xml as this:
<mimeType name="application/rdf+xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
But still, the message persists.
Even when i use Any23, mapping the parse filter as
<alias name="any23-parserFilter"
extension-id="Any23Parser" />
and setting the parsers for the mime type as:
<mimeType name="application/rdf+xml">
<plugin id="parse-tika" />
<plugin id="feed" />
</mimeType>
The message still persists.
What i'm missing here?
The Nutch any23 plugin is targeted to embedded RDF (RDFa) and Microdata. Technically, it only implements the HtmlParseFilter which requires that the document is successfully parsed by a Parser implementation.
To extract RDFa, try this and you should see many extracted triples:
> bin/nutch parsechecker \
-Dany23.extractors=html-microdata,html-rdfa11 \
-Dplugin.includes='protocol-http|parse-html|any23' \
https://schema.org/NewsArticle
...
Any23-Triples=<https://schema.org/NewsArticle> <http://www.w3.org/ns/rdfa#usesVocabulary> <http://schema.org/> .
...