Search code examples
solrhbasenutchrobots.txtmetatag

How do you configure Apache Nutch 2.3 to honour robots metatag?


I have Nutch 2.3 setup with HBase as the backend and I run a crawl of which includes the index to Solr and Solr Deduplication.

I have recently noticed that the Solr index contains unwanted webpages.

In order to get Nutch to ignore these webpages I set the following metatag:

<meta name="robots" content="noindex,follow"> 

I have visited the apache nutch official website and it explains the following:

If you do not have permission to edit the /robots.txt file on your server, you can still tell robots not to index your pages or follow your links. The standard mechanism for this is the robots META tag

Searching the web for answers, I found a recommendations to set Protocol.CHECK_ROBOTS or set protocol.plugin.check.robots as a property in nutch-site.xml. None of these appear to work.

At current Nutch 2.3 ignores the noindex rule, therefore indexing the content to the external datastore ie Solr.

The question is how do I configure Nutch 2.3 to honour robots metatags?

Also if Nutch 2.3 was previously configured to ignore robot metatag and during a previous crawl cycle indexed that webpage. Providing the rules for the robots metatag are correct, will this result in the page being removed from the Solr index in future crawls?


Solution

  • I've created a plugin to overcome the problem of Apache Nutch 2.3 NOT honouring the robots metatag rule noindex. The metarobots plugin forces Nutch to discard qualifying documents during index. This prevents the qualifying documents being indexed to your external datastore ie Solr.

    Please note: This plugin prevents the index of documents that contain robots metatag rule noindex, it does NOT remove any documents that were previously indexed to your external datastore.

    Visit this link for instructions