I am using Apache Nutch 1.12 and the URLs I am trying to crawl is something like https://www.mywebsite.com/abc-def/ which is the only entry in my seed.txt file. Since I don't want any page to be crawl that doesn't have "abc-def" in the URL so I have put the following line in regex-urlfilter.txt :
+^https://www.mywebsite.com/abc-def/(.+)*$
When I try to run the following crawl command :
**/bin/crawl -i -D solr.server.url=http://mysolr:3737/solr/coreName $NUTCH_HOME/urls/ $NUTCH_HOME/crawl 3**
It crawl and index just one seed.txt url and in 2nd iteration it just say:
Generator: starting at 2017-02-28 09:51:36
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: false
Generator: normalizing: true
Generator: topN: 50000
Generator: 0 records selected for fetching, exiting ...
Generate returned 1 (no new segments created)
Escaping loop: no more URLs to fetch now
When I change the regex-urlfilter.txt to allow everything(+.) it started indexing every URL on https://www.mywebsite.com which certainly I don't want.
If anyone happen to have the same problem, please share how you get past it.
Got that working after trying multiple things in last 2 days.Here is the solution:
Since the website I was crawling was very heavy, the property in nutch-default.xml was truncating it to 65536 bytes(default).The links I wanted to crawl unfortunately didn't get included in the selected part and hence nutch wasn't crawling it.When I changed it to unlimited by putting the following values in nutch-site.xml it starts crawling my pages :
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>