Search code examples
apacheweb-crawlernutch

Nutch not crawling URLs except the one specified in seed.txt


I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt :

+^https://www.mywebsite.com/abc-def/(.+)*$

When I try to run the following crawl command :

**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**

It crawl and index just one seed.txt url and in 2nd iteration it just say:

Generator: starting at 2017-02-28 09:51:36

Generator: Selecting best-scoring urls due for fetch.

Generator: filtering: false

Generator: normalizing: true

Generator: topN: 50000

Generator: 0 records selected for fetching, exiting ...

Generate returned 1 (no new segments created)

Escaping loop: no more URLs to fetch now

When I change the regex-urlfilter.txt to allow everything(+.) it started indexing every URL on https://www.mywebsite.com which certainly I don't want.

If anyone happen to have the same problem, please share how you get past it.


Solution

  • Got that working after trying multiple things in last 2 days.Here is the solution:

    Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :

    <property>
      <name>http.content.limit</name>
      <value>-1</value>
      <description>The length limit for downloaded content using the http://
      protocol, in bytes. If this value is nonnegative (>=0), content longer
      than it will be truncated; otherwise, no truncation at all. Do not
      confuse this setting with the file.content.limit setting.
      </description>
    </property>