hadoop mapreduce web-crawler amazon-emr nutch

Nutch FetchData job is too slow

I'm using Apache Nutch to crawl about 7000 URLs with 6 cycles in EMR cluster programmatically (there are few custom map-reduce jobs in the middle of crawl). Versions are: nutch=v1.15 hadoop=2.7.3 I'm running it on Amazon EMR cluster with 20 EC2 m4.large spot instances. The code for crawling is:

    public crawl(Folder seeds, Folder output) 
        throws IOException, InterruptedException {
        final Folder crawldb = output.folder("crawldb");
        try {
            new Injector(this.conf).inject(
                crawldb.path(), seeds.path(),
                true, true
            );
        } catch (final ClassNotFoundException err) {
            throw new IOException("Failed to inject URLs", err);
        }
        final Folder segments = output.mkdir("segments");
        // cycles = 6 in my case
        for (int idx = 0; idx < cycles; ++idx) {
            this.cycle(crawldb, segments);
        }
    }

    private void cycle(final Folder crawldb, final Folder segments)
        throws IOException, InterruptedException {
        try {
            Logger.info(this, "Generating...");
            // configured as 1_000_000 in EMR cluster
            final int topn = this.conf.getInt("yc.gen.topn", 1000);
            // configured as 40 (2 x slave_nodes) in EMR cluster
            final int nfetch = this.conf.getInt("yc.gen.nfetch", 1);
            new Generator(this.conf).generate(
                crawldb.path(),
                segments.path(),
                nfetch, topn, System.currentTimeMillis()
            );
            // the latest segment
            final Optional<Folder> next = Batch.nextSegment(segments);
            if (next.isPresent()) {
                final Path sgmt = next.get().path();
                Logger.info(this, "Fetching %s...", sgmt);
                new Fetcher(this.conf).fetch(
                    // @checkstyle MagicNumber (1 line)
                    sgmt, 10
                );
                Logger.info(this, "Parsing %s...", sgmt);
                new ParseSegment(this.conf).parse(sgmt);
            }
            new CrawlDb(this.conf).update(
                crawldb.path(),
                // all segments paths
                segments.subfolders().stream()
                    .toArray(Path[]::new),
                true, true
            );
        } catch (final ClassNotFoundException err) {
            throw new IOException(
                "Failed to generate/fetch/parse segment", err
            );
        }
    }

When I'm running it with 7000 seed URLs and 6 run cycles Nutch becomes very slow on FetchData job: it's running about 3 hours, and it seems that it's waiting for one last mapper for complete for about 2.5 last hours (see screenshots attached). What may be the problem with this job and how can I speed up FetchData phase, maybe I can configure it to skip slow fetchers (if I miss few URLs it's not a big problem).

Solution

Nutch's generator job partitions the fetch list into queues by host (alternatively domain, see partition.url.mode). All URLs of one fetch queue are processed in one fetcher map task to ensure politeness constraints - there is only one single connection to one host at any time and there are guaranteed delays between requests to the same host. The partitioning is also important for performance because DNS resolution, robots.txt parsing and caching of the results can be done locally in a map task.

If one or few fetch queues are too long or few crawled hosts respond too slowly, these queues "block" the crawling progress. To overcome this issue there are three options which can be even combined:

limit the time a fetcher map task is allowed to run using the property fetcher.timelimit.mins. If the time limit is hit remaining URLs from fetch queues are skipped and fetched in the next cycle.
make sure no queue is getting too large using generate.max.count and generate.count.mode
(only if you're allowed to crawl all hosts with more aggressive settings) you could use a shorter crawl delay (fetcher.server.delay) or even allow parallel connections (fetcher.threads.per.queue)

There are more options to tune the performance of a crawl, all properties are documented in the file conf/nutch-default.xml. The default values are good to ensure completeness on a crawl restricted to a set of hosts/domains and need to be changed to get a high throughput in a broad crawl where it is accepted that some hosts/domains cannot be crawled exhaustively.