So I'm trying to create a data scraper with Nodejs using the Request module. I'd like to limit the concurrency to 1 domains on a 20ms cycle to go through 50,000 urls.
When I execute the code, I'm DoS-ing the network with the 40Gbps bandwidth my system has access to... This creates local problems and remote problems.
The 5 concurrent scans on a 120ms cycle for 50k domains (if I calculated correctly) will finish the list in ~20 minutes and will not create any issues remotely at least.
The code I'm testing with:
var urls = // data from mongodb
urls.forEach(fn(url) {
// pseudo
request the url
process
});
The forEach function executes instantly "queueing" all urls and tries to fetch all. It seems impossible to do a delay on each loop. All google searches seem to show how to rate limit incoming request to your server/api. Same thing appears to happen with a for loop as well. Impossible to control how fast the loops execute. I'm missing something probably or the code logic is wrong. Any suggestions?
async/await
and Promises instead callbacks.p-map
or similar approach form promise-funThere is copypasted example:
const pMap = require('p-map');
const urls = [
'sindresorhus.com',
'ava.li',
'github.com',
…
];
console.log(urls.length);
//=> 100
const mapper = url => {
return fetchStats(url); //=> Promise
};
pMap(urls, mapper, {concurrency: 5}).then(result => {
console.log(result);
//=> [{url: 'sindresorhus.com', stats: {…}}, …]
});