Search code examples
web-crawlernutchstormcrawler

Apache Nutch Crawler - Crawl new injected URLs in existing table only


I have to crawl some URLs via Nutch. For this, I have to provide seed URLs every time. Hence they are injected every time in the same table. Now, as time is passing, the database is going to increase and in generate phase, it looks for all URLs that takes time. Is there some way to instruct Nutch to crawl new injected URLs only and do not look into the table (for old URLs). Or is there some better approach for this.


Solution

    1. (assumed that "table" stands for "WebTable" used by Nutch 2.x to persist crawled web pages in one of the supported storage back-ends, HBase, etc.): the generator marks fetch lists by a batch-ID, see the script bin/crawl for details how batch-IDs are used. It's an arbitrary but unique string, not too long as there are length limits in some of the storage-backend (see the gora-*-mapping.xml). To skip the generation step, you could just use any other tool to mark the freshly injected URLs with a custom batch-ID and then call fetch, parse, updatedb, index using this ID.

    2. (in case it's about Nutch 1.x) there is a tool freegen which takes a list of URLs (a text file) and creates a segment from it. Then call fetch, parse, updatedb, index passing as parameter the path of the created segment.