Search code examples
configuration-filesnutch

Nutch crawler: Configure to accept only pages in English


How can I configure the Nutch crawler to crawl only English pages?

This is what I set in nutch-site.xml, but it does not work:

<property>
    <name>http.accept.language</name>
    <value>en-us,en-gb,en;q=0.7,*;q=0.3</value>
    <description>Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group.
    </description>
</property>

Solution

  • The value you set: <value>en-us,en-gb,en;q=0.7,*;q=0.3</value> means that it prefers English but other languages (*) still there. For crawling only English pages, you should set value as below:

    <value>en-us,en-gb,en</value>
    

    To make sure, change the value in nutch-default.xml as well.

    Hope this helps

    -Le Quoc Do