Search code examples
javascriptnode.jsweb-crawlerapifycrawlee

Crawlee scrapper invoking the same handler multiple times


I've built a Crawlee scrapper, but for some reason it invokes the same handler multiple times, creating a lot of duplicate requests and entries in my dataset. Also:

  • I've already tried manually setting uniqueKeys for all my requests.
  • I've also tried setting maxConcurrency: 1 for the crawler.
  • As you can see from the logs below, the issue is not that I'm adding the same requests multiple times. It's Crawlee who's invoking handlers multiple times with the same request.

Here are the relevant (simplified) files:

main.ts:

await Actor.init();

const crawler = new CheerioCrawler({
  requestHandler: router,
  sameDomainDelaySecs: 3,
  maxRequestRetries: 3,
  maxConcurrency: 1,
});

const originalAddRequestsFn = crawler.addRequests.bind(crawler);

crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
  if (requests.length > 1) {
    log.info(`INITIAL REQUESTS = ${ requests.length }`);
  } else {
    log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
  }

  return originalAddRequestsFn(requests, options);
}

const requestsOptions: RequestOptions<ScrapperData>[] = [{
  uniqueKey: `ROUTE_A_${ dataset[0].startURL }`,
  url: dataset[0].startURL,
  label: RouterHandlerLabels.ROUTE_A,
  userData: { datasetIndex: 0 },
}, {
  uniqueKey: `ROUTE_A_${ dataset[1].startURL }`,
  url: dataset[1].startURL,
  label: RouterHandlerLabels.ROUTE_A,
  userData: { datasetIndex: 1 },
}];

try {
  await crawler.run(requestsOptions);
  await Dataset.exportToJSON(JSON_OUTPUT_FILE_KEY);
} finally {
  await Actor.exit();
}

router.ts:

export enum RouterHandlerLabels {
  ROUTE_A = 'route-a',
  ROUTE_B = 'route-b',
  ROUTE_C = 'route-c',
}

export const router = createCheerioRouter();

router.addHandler(RouterHandlerLabels.ROUTE_A, handlerA);
router.addHandler(RouterHandlerLabels.ROUTE_B, handlerB);
router.addHandler(RouterHandlerLabels.ROUTE_C, handlerC);

router.addDefaultHandler(async ({ log }) => {
  log.info('Default handler...');
});

handler-a.ts:

export async function handlerA({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
  const { datasetIndex } = request.userData;

  log.info(`A. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);

  const pageHTML = $('body').html() || '';
  const nextURL = findLinkToB(pageHTML);

  if (!nextURL) return;

  log.info('A. Call addRequests(...)');

  await crawler.addRequests([{
    uniqueKey: `ROUTE_B_${ nextURL }`,
    url: nextURL,
    headers: DEFAULT_REQUEST_HEADERS,
    label: RouterHandlerLabels.ROUTE_B,
    userData: request.userData,
  }]);
}

handler-b.ts:

export async function handlerB({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
  const { datasetIndex } = request.userData;

  log.info(`B. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);

  const pageHTML = $('body').html() || '';
  const nextURL = findLinkToC(pageHTML);

  if (!nextURL) return;

  log.info('B. Call addRequests(...)');

  await crawler.addRequests([{
    uniqueKey: `ROUTE_C_${ nextURL }`,
    url: nextURL,
    headers: DEFAULT_REQUEST_HEADERS,
    label: RouterHandlerLabels.ROUTE_C,
    userData: request.userData,
  }]);
}

handler-c.ts:

export async function handlerC({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
  const { datasetIndex } = request.userData;

  log.info(`C. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);

  const pageHTML = $('body').html() || '';
  const extractedData = findDataInPageC(pageHTML);

  if (!extractedData) return;

  log.info(`C. Saving data for ${ datasetIndex }`);

  await pushData({ ...extractedData, datasetIndex });
}

These are the logs I get:

INFO  System info {"apifyVersion":"3.1.12","apifyClientVersion":"2.8.1","crawleeVersion":"3.5.8","osType":"Linux","nodeVersion":"v20.8.1"}
INFO  INITIAL REQUESTS = 2
INFO  CheerioCrawler: Starting the crawler.
INFO  CheerioCrawler: A. 0: https://example.com/page-a/user-0
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-0 = https://example.com/page-b/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO  CheerioCrawler: A. Call addRequests(...)
INFO  ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO  Statistics: CheerioCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5599,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":50388,"requestsTotal":9,"crawlerRuntimeMillis":61279,"retryHistogram":[9]}
INFO  CheerioCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":0.858},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO  CheerioCrawler: C. Saving data for 0
INFO  CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO  CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO  CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO  CheerioCrawler: C. Saving data for 1
INFO  CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO  CheerioCrawler: B. Call addRequests(...)
INFO  ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO  CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO  CheerioCrawler: C. Saving data for 1
INFO  CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO  CheerioCrawler: C. Saving data for 1
INFO  CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  CheerioCrawler: Final request statistics: {"requestsFinished":19,"requestsFailed":0,"retryHistogram":[19],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5150,"requestsFinishedPerMinute":10,"requestsFailedPerMinute":0,"requestTotalDurationMillis":97844,"requestsTotal":19,"crawlerRuntimeMillis":115660}
INFO  CheerioCrawler: Finished! Total 19 requests: 19 succeeded, 0 failed. {"terminal":true}

In this case, it produced a total of 7 results: 4 for the first dataset entry and 3 for the second one (it should actually be only one for each, so 2 results in total).

Line 13 on the logs would be the first one that doesn't make sense:

INFO  CheerioCrawler: A. 1: https://example.com/page-a/user-1

As at that point, both requests to page-a, one for user-0 and one for user-1, have already been handled (lines 4 and 7, respectively).

I've tried adding only 1 initial request (when calling crawler.run(...)), but some handlers are still getting invoked more than once for the same request.

I'm using crawlee 3.5.8.


Solution

  • Ok, so I got some help from Apify on their Discord and it's a known bug:

    This issue specifically arises when we utilize the sameDomainDelaySecs feature with [email protected]. Interestingly, we do not encounter this problem when using the same feature with [email protected]. Consequently, we suspect that this warning may be connected to this fix #2045.

    I've tried versions 3.5.2 and 3.5.0 and I still ahve the same issue, so I ended up removing sameDomainDelaySecs and adding an await sleep(delayInMs) before adding new requests.

    You can do that manually before calling crawler.addRequests, or you can overwrite crawler.addRequests so that it always waits a few seconds before adding new ones:

    const originalAddRequestsFn = crawler.addRequests.bind(crawler);
    
    crawler.addRequests = async function(
      requests: Source[],
      options: CrawlerAddRequestsOptions,
    ) {
      await sleep(DELAY_IN_MS);
    
      return originalAddRequestsFn(requests, options);
    }