I've built a Crawlee scrapper, but for some reason it invokes the same handler multiple times, creating a lot of duplicate requests and entries in my dataset. Also:
uniqueKey
s for all my requests.maxConcurrency: 1
for the crawler.Here are the relevant (simplified) files:
main.ts
:
await Actor.init();
const crawler = new CheerioCrawler({
requestHandler: router,
sameDomainDelaySecs: 3,
maxRequestRetries: 3,
maxConcurrency: 1,
});
const originalAddRequestsFn = crawler.addRequests.bind(crawler);
crawler.addRequests = function(requests: Source[], options: CrawlerAddRequestsOptions) {
if (requests.length > 1) {
log.info(`INITIAL REQUESTS = ${ requests.length }`);
} else {
log.info(`${ requests[0].label } | ${ requests[0].uniqueKey || '-' } = ${ requests[0].url }`);
}
return originalAddRequestsFn(requests, options);
}
const requestsOptions: RequestOptions<ScrapperData>[] = [{
uniqueKey: `ROUTE_A_${ dataset[0].startURL }`,
url: dataset[0].startURL,
label: RouterHandlerLabels.ROUTE_A,
userData: { datasetIndex: 0 },
}, {
uniqueKey: `ROUTE_A_${ dataset[1].startURL }`,
url: dataset[1].startURL,
label: RouterHandlerLabels.ROUTE_A,
userData: { datasetIndex: 1 },
}];
try {
await crawler.run(requestsOptions);
await Dataset.exportToJSON(JSON_OUTPUT_FILE_KEY);
} finally {
await Actor.exit();
}
router.ts
:
export enum RouterHandlerLabels {
ROUTE_A = 'route-a',
ROUTE_B = 'route-b',
ROUTE_C = 'route-c',
}
export const router = createCheerioRouter();
router.addHandler(RouterHandlerLabels.ROUTE_A, handlerA);
router.addHandler(RouterHandlerLabels.ROUTE_B, handlerB);
router.addHandler(RouterHandlerLabels.ROUTE_C, handlerC);
router.addDefaultHandler(async ({ log }) => {
log.info('Default handler...');
});
handler-a.ts
:
export async function handlerA({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`A. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const nextURL = findLinkToB(pageHTML);
if (!nextURL) return;
log.info('A. Call addRequests(...)');
await crawler.addRequests([{
uniqueKey: `ROUTE_B_${ nextURL }`,
url: nextURL,
headers: DEFAULT_REQUEST_HEADERS,
label: RouterHandlerLabels.ROUTE_B,
userData: request.userData,
}]);
}
handler-b.ts
:
export async function handlerB({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`B. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const nextURL = findLinkToC(pageHTML);
if (!nextURL) return;
log.info('B. Call addRequests(...)');
await crawler.addRequests([{
uniqueKey: `ROUTE_C_${ nextURL }`,
url: nextURL,
headers: DEFAULT_REQUEST_HEADERS,
label: RouterHandlerLabels.ROUTE_C,
userData: request.userData,
}]);
}
handler-c.ts
:
export async function handlerC({ request, $, pushData, log, crawler }: CheerioCrawlingContext<ScrapperData>) {
const { datasetIndex } = request.userData;
log.info(`C. ${ datasetIndex }: ${ request.loadedUrl || '?' }`);
const pageHTML = $('body').html() || '';
const extractedData = findDataInPageC(pageHTML);
if (!extractedData) return;
log.info(`C. Saving data for ${ datasetIndex }`);
await pushData({ ...extractedData, datasetIndex });
}
These are the logs I get:
INFO System info {"apifyVersion":"3.1.12","apifyClientVersion":"2.8.1","crawleeVersion":"3.5.8","osType":"Linux","nodeVersion":"v20.8.1"}
INFO INITIAL REQUESTS = 2
INFO CheerioCrawler: Starting the crawler.
INFO CheerioCrawler: A. 0: https://example.com/page-a/user-0
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-0 = https://example.com/page-b/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: B. 0: https://example.com/page-b/user-0
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-0 = https://example.com/page-c/user-0
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
INFO CheerioCrawler: A. Call addRequests(...)
INFO ROUTE_B | ROUTE_B_https://example.com/page-b/user-1 = https://example.com/page-b/user-1
INFO Statistics: CheerioCrawler request statistics: {"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5599,"requestsFinishedPerMinute":9,"requestsFailedPerMinute":0,"requestTotalDurationMillis":50388,"requestsTotal":9,"crawlerRuntimeMillis":61279,"retryHistogram":[9]}
INFO CheerioCrawler:AutoscaledPool: state {"currentConcurrency":1,"desiredConcurrency":1,"systemStatus":{"isSystemIdle":false,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":0},"eventLoopInfo":{"isOverloaded":true,"limitRatio":0.7,"actualRatio":0.858},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}}
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: C. 0: https://example.com/page-c/user-0
INFO CheerioCrawler: C. Saving data for 0
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: B. 1: https://example.com/page-b/user-1
INFO CheerioCrawler: B. Call addRequests(...)
INFO ROUTE_C | ROUTE_C_https://example.com/page-c/user-1 = https://example.com/page-c/user-1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: C. 1: https://example.com/page-c/user-1
INFO CheerioCrawler: C. Saving data for 1
INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO CheerioCrawler: Final request statistics: {"requestsFinished":19,"requestsFailed":0,"retryHistogram":[19],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":5150,"requestsFinishedPerMinute":10,"requestsFailedPerMinute":0,"requestTotalDurationMillis":97844,"requestsTotal":19,"crawlerRuntimeMillis":115660}
INFO CheerioCrawler: Finished! Total 19 requests: 19 succeeded, 0 failed. {"terminal":true}
In this case, it produced a total of 7
results: 4
for the first dataset entry and 3
for the second one (it should actually be only one for each, so 2
results in total).
Line 13
on the logs would be the first one that doesn't make sense:
INFO CheerioCrawler: A. 1: https://example.com/page-a/user-1
As at that point, both requests to page-a
, one for user-0
and one for user-1
, have already been handled (lines 4
and 7
, respectively).
I've tried adding only 1 initial request (when calling crawler.run(...)
), but some handlers are still getting invoked more than once for the same request.
I'm using crawlee
3.5.8
.
Ok, so I got some help from Apify on their Discord and it's a known bug:
This issue specifically arises when we utilize the
sameDomainDelaySecs
feature with[email protected]
. Interestingly, we do not encounter this problem when using the same feature with[email protected]
. Consequently, we suspect that this warning may be connected to this fix#2045
.
I've tried versions 3.5.2
and 3.5.0
and I still ahve the same issue, so I ended up removing sameDomainDelaySecs
and adding an await sleep(delayInMs)
before adding new requests.
You can do that manually before calling crawler.addRequests
, or you can overwrite crawler.addRequests
so that it always waits a few seconds before adding new ones:
const originalAddRequestsFn = crawler.addRequests.bind(crawler);
crawler.addRequests = async function(
requests: Source[],
options: CrawlerAddRequestsOptions,
) {
await sleep(DELAY_IN_MS);
return originalAddRequestsFn(requests, options);
}