Search code examples
node.jsweb-scrapingplaywrightcrawlee

An exception is thrown when the load more button is detached from the DOM an I'm not able to get one of the API responses


I'm using crawlee JS with a headless browser provided by PlayWright to catch the content of a page that is using an infinite scroll technique.

I need to get the data of a website, the website loads this data using an infinite scroll technique

To achieve this I'm intercepting the http requests that loads the data. I can't get one of the pages of data.

The "load more" button that loads more data in the infinite scroll container is detached (is no longer needed because is the last page and there isn't more content to load). When the button is removed, an exception is thrown and I thought that the exception is the reason I can't get one of the API responses (the last one)

For managing the infinite scroll, I'm using the infiniteScroll utility function

For getting the api request responses I'm listening to the requestfinished event

The exception I'm getting is:

elementHandle.click: Element is not attached to the DOM

My current code is

 import { PlaywrightCrawler, Dataset, launchPlaywright, playwrightUtils} from 'crawlee';

const productItemScrollPage = {
    domain: 'https://www.chedraui.com.mx',
    url: "https://www.chedraui.com.mx/mascotas/perros/alimento",
    jsonResponseProductListPropertyPath: "data.productSearch.products",
    jsonRequestProductPageUrl: "https://www.chedraui.com.mx/_v/segment/graphql/v1", 
    urlAttributesToMatch: ["operationName=productSearchV3"],
    hasLoadMoreBtn: false,
    loadMoreBtnSelector: ""
}

const productItemScrollPageV2 = {
    domain: 'https://www.heb.com.mx',
    url: "https://www.heb.com.mx/mascotas/perros/alimento-seco",
    jsonResponseProductListPropertyPath: "data.productSearch.products",
    jsonRequestProductPageUrl: "https://www.heb.com.mx/_v/segment/graphql/v1", 
    urlAttributesToMatch: ["operationName=productSearchV3"],
    hasLoadMoreBtn: true,
    loadMoreBtnSelector: ".vtex-search-result-3-x-buttonShowMore button"
}


function areAllAttributesPresentInTheURLRequest(url,productItemScrollPage){
    return productItemScrollPage.urlAttributesToMatch.filter((value) => {
        return url.includes(value)
    }).length == productItemScrollPage.urlAttributesToMatch.length
}

const crawler = new PlaywrightCrawler({
    headless: false,
    maxRequestsPerCrawl: 20,
    maxRequestRetries: 10,
    retryOnBlocked: true,
    requestHandler: async ({ page, request, enqueueLinks }) => {        
        page.on('requestfinished', async (request) => {
            if(request.url().startsWith(productItemScrollPageV2.jsonRequestProductPageUrl) && 
                areAllAttributesPresentInTheURLRequest(request.url(),productItemScrollPageV2)
            ){

                const res = await request.response()
                const jsonObj = await res.json()
            }
        });


        await playwrightUtils.infiniteScroll(page,{
            buttonSelector: productItemScrollPageV2.hasLoadMoreBtn ? productItemScrollPageV2.loadMoreBtnSelector : "",
            waitForSecs: 30
        })

    },
});

Solution

  • The problem was not the exception being thrown (It's easy to solve this by wraping the code in a try{ }catch(e){} block), It was not the last page of results that was not being caught but the first.

    The first request is not caught because is not fired due to UI events, it's fired when the javascript is starting to get executed, the requestfinished event handler needs to be added before the navigation to the page happens.

    I solved this problem by adding the preNavigationHooks property to the PlayWrightCrawler config options.

    preNavigationHooks property is an array of async functions that are executed before the navigation to an specific page happen. Inside those functions you can config the page object before it navigates to some url.

    const crawler = new PlaywrightCrawler({
        preNavigationHooks: [
            async (crawlingContext, gotoOptions) => {
                const { page } = crawlingContext;
                page.on('requestfinished', async (request) => {
                    if(request.url().startsWith(productItemScrollPageV2.jsonRequestProductPageUrl) && 
                        areAllAttributesPresentInTheURLRequest(request.url(),productItemScrollPageV2)
                    ){
                        const res = await request.response()
                        const jsonObj = await res.json()
                        productCounter += resolvePath(jsonObj,productItemScrollPageV2.jsonResponseProductListPropertyPath).length
                    }
                });
                
            }
        ],
        headless: false,
        maxRequestsPerCrawl: 20,
        maxRequestRetries: 10,
        retryOnBlocked: true,
        requestHandler: async ({ page, request, enqueueLinks }) => {            
            await playwrightUtils.infiniteScroll(page,{
                buttonSelector: productItemScrollPageV2.hasLoadMoreBtn ? productItemScrollPageV2.loadMoreBtnSelector : "",
                waitForSecs: 30
            })
    
        },
    });
    

    By adding the requestfinished event handler in a pre-navigation hook I ensure the event is added before any javascript is executed in the web page. This way I can catch the requests that are not fired by UI events but by a page script.