Search code examples
javascriptnode.jsarrayspuppeteerarrow-functions

Filter an array during an arrow function execution


I'm using Node's Puppeteer library to scrape a website's table.

During the scraping, it returns two arrays, one containing all rows and columns of said table and the second array just contains the first column of the table. I don't know why this happens and couldn't fix this issue.

This is the code I'm using to scrape the table:

var result = await page.$$eval('tbody > tr', rows => {
    return Array.from(rows, row => {
        const columns = row.querySelectorAll('td');
        const arr = Array.from(columns, column => column.innerText);
        if (arr.length <= 1) {
            return;
        }
        return arr;
    });
});

As you can see, I'm trying to filter this second table out of the resulting array, but, I assume that since the arrow function is already doing it's thing, simply calling return will result in a null value inside the array. I don't want that, since the array has double the amount of values because of this issue.

I can filter out the null values using this code:

var filtered = result.filter(function (el) {
    return el != null;
});

But in doing so I'm iterating the array a second time just to filter out the null values, that's double the amount of time taken to execute my routine.

My question here is: How to filter out these rows with a column count equal to or less than 1?

Edit: Even though I accepted James' answer, I should mention that the real fix to my problem was pointed out by Barmar in the comment section, I should've evaluated the page using table#filter--result-table-resumo > tbody > tr, thus filtering out the second unwanted table.

Here is the final code:

var result = await page.$$eval('table#filter--result-table-resumo > tbody > tr', rows => {
    return Array.from(rows, row => {
        const columns = row.querySelectorAll('td');
        return Array.from(columns, column => column.innerText);
    });
});

Solution

  • Array.from is designed to return an array of some exact length, it's not meant for filtering out rows, for that you need array.filter.

    Array.from(rows).filter(row => row.querySelectorAll('td').length > 1);
    

    As @Barmar points out I missed your mapping. Rather than using .filter and .map (you've pointed out efficiency is important) you can combine those using a .reduce to do both operations in one step:

    Array.from(rows).reduce((acc, row) => {
        const columns = row.querySelectorAll('td');
        if (columns.length > 1) {
          acc.push(Array.from(columns, column => column.innerText));
        }
        return acc;
      }, []);