I want to scrape a website which has a list of products and each product has a specific page with more data. I wanted to do it using MAP ASYNC + PROMISE.ALL instead of FOR-OF, however I couldn't make it Work properly.
Working sample with for-of:
const puppeteer = require("puppeteer");
const SELECTOR_ITEMS_LINKS =
".sg-col-4-of-12.s-result-item.sg-col-4-of-16.sg-col.sg-col-4-of-20 .a-link-normal.s-no-outline";
const removeEmptyLines = (txt) => txt.replace(/\n\n/g, "");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.amazon.com/s?k=gaming+chair");
const links = await page.$$eval(SELECTOR_ITEMS_LINKS, (links) =>
links.map((link) => link.href)
);
for (const link of links) {
await page.goto(link);
const rawTitle = await page.$eval("#productTitle", (el) => el.textContent);
const title = removeEmptyLines(rawTitle);
console.log({ link, title });
}
await browser.close();
})();
Result:
{
link: 'https://www.amazon.com/AJS-Clearance-Computer-Armrests-Adjustment/dp/B08QHZX2M9/ref=sr_1_26?dchild=1&keywords=gaming+chair&qid=1616435089&sr=8-26',
title: 'AJS Office Chairs Clearance, Cheap Gaming Chair for Teens, Fabric Computer Desk Chair with Padded Armrests and Height Adjustment (Red)\n'
}
{
link: 'https://www.amazon.com/Swivel-Gaming-Support-Adjustable-Lounger/dp/B089D2DDNT/ref=sr_1_27?dchild=1&keywords=gaming+chair&qid=1616435089&sr=8-27',
title: 'Swivel Gaming Floor Chair with Arms Back Support Adjustable Floor Sofa for Adults Teens Lazy Sofa Lounger Video Game Chair, Black and Blue\n'
}
{
link: 'https://www.amazon.com/Nokaxus-Retractible-adjustment-Thickening-YK-6008-BLACK/dp/B07DZKG7SN/ref=sr_1_28?dchild=1&keywords=gaming+chair&qid=1616435089&sr=8-28',
title: 'Nokaxus Gaming Chair Large Size High-back Ergonomic Racing Seat with Massager Lumbar Support and Retractible Footrest PU Leather 90-180 degree adjustment
of backrest Thickening sponges (YK-6008-BLACK)\n'
}
Now I want to do the same code but using MAP instead of FOR-OF. Sample:
const puppeteer = require("puppeteer");
const SELECTOR_ITEMS_LINKS =
".sg-col-4-of-12.s-result-item.sg-col-4-of-16.sg-col.sg-col-4-of-20 .a-link-normal.s-no-outline";
const removeEmptyLines = (txt) => txt.replace(/\n\n/g, "");
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://www.amazon.com/s?k=gaming+chair");
const links = await page.$$eval(SELECTOR_ITEMS_LINKS, (links) =>
links.map((link) => link.href)
);
const resolver = async (link) => {
await page.goto(link);
const rawTitle = await page.$eval("#productTitle", (el) => el.textContent);
const title = removeEmptyLines(rawTitle);
return { link, title };
};
const promises = await links.map((link) => resolver(link));
const result = await Promise.all(promises);
console.log(result);
browser.close();
})();
And what I get is the same data, like if it's ignoring the other links. Result:
{
link: 'https://www.amazon.com/OSP-Furniture-Ergonomic-Adjustable-Accents/dp/B08PDS88PZ/ref=sr_1_58?dchild=1&keywords=gaming+chair&qid=1616435001&sr=8-58',
title: 'Soontrans Rocking Gaming Chair,Ergonomic PC Computer Chair,Home Office Chair,Racing Chair with Adjustable Recliner and Armrest with Headrest Lumbar Pillow Support (Green)\n'
},
{
link: 'https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_btf_aps_sr_pg1_1?ie=UTF8&adId=A05234183UPC2EOCKELXB&url=%2FNOUHAUS-Palette-Ergonomic-Comfortable-Computer%2Fdp%2FB083SN6BVS%2Fref%3Dsr_1_59_sspa%3Fdchild%3D1%26keywords%3Dgaming%2Bchair%26qid%3D1616435001%26sr%3D8-59-spons%26psc%3D1%26smid%3DA1DPRB9NBV0XDD&qualifier=1616435001&id=7849373560319144&widgetName=sp_btf',
title: 'Soontrans Rocking Gaming Chair,Ergonomic PC Computer Chair,Home Office Chair,Racing Chair with Adjustable Recliner and Armrest with Headrest Lumbar Pillow Support (Green)\n'
},
{
link: 'https://www.amazon.com/gp/slredirect/picassoRedirect.html/ref=pa_sp_btf_aps_sr_pg1_1?ie=UTF8&adId=A034517018WLNV9ZSL6AD&url=%2FSoontrans-Ergonomic-Computer-Adjustable-Recliner%2Fdp%2FB08HWPJZP2%2Fref%3Dsr_1_60_sspa%3Fdchild%3D1%26keywords%3Dgaming%2Bchair%26qid%3D1616435001%26sr%3D8-60-spons%26psc%3D1&qualifier=1616435001&id=7849373560319144&widgetName=sp_btf',
title: 'Soontrans Rocking Gaming Chair,Ergonomic PC Computer Chair,Home Office Chair,Racing Chair with Adjustable Recliner and Armrest with Headrest Lumbar Pillow Support (Green)\n'
}
Do you know how to accomplish the same result with MAP?
The problem is that your code is going in (pseudo) parallel. So they are stepping into each other. You can fixing it by creating a new page on each call:
const resolver = async (link) => {
const newPage = await browser.newPage();
await newPage.goto(link);
const rawTitle = await page.$eval("#productTitle", (el) => el.textContent);
const title = removeEmptyLines(rawTitle);
await newPage.close();
return { link, title };
};