Search code examples
web-crawlerrdfnutchlinked-data

Can't crawl RDF Data with Apache Nutch


I am trying to Crawl the DBpedia with Apache Nutch 1.15, but i'm having problems with parsing RDF files.

On the parsing phase, i only get this message:

**apache_nutch | Error parsing: http://dbpedia.org/data/Moscow.xml: failed(2,0): Can't retrieve Tika parser for mime-type application/rdf+xml **

following this reference, i configured my parse-plugins.xml to parse application/rdf+xml as this:

<mimeType name="application/rdf+xml">
    <plugin id="parse-tika" />
    <plugin id="feed" />
</mimeType>

But still, the message persists.

Even when i use Any23, mapping the parse filter as

<alias name="any23-parserFilter"
        extension-id="Any23Parser" />

and setting the parsers for the mime type as:

<mimeType name="application/rdf+xml">
    <plugin id="parse-tika" />
    <plugin id="feed" />
</mimeType>

The message still persists.

What i'm missing here?


Solution

  • The Nutch any23 plugin is targeted to embedded RDF (RDFa) and Microdata. Technically, it only implements the HtmlParseFilter which requires that the document is successfully parsed by a Parser implementation.

    To extract RDFa, try this and you should see many extracted triples:

    > bin/nutch parsechecker \
       -Dany23.extractors=html-microdata,html-rdfa11 \
       -Dplugin.includes='protocol-http|parse-html|any23' \
      https://schema.org/NewsArticle
    ...
    Any23-Triples=<https://schema.org/NewsArticle> <http://www.w3.org/ns/rdfa#usesVocabulary> <http://schema.org/> .
    ...