I'm using Node's Puppeteer library to scrape a website's table.
During the scraping, it returns two arrays, one containing all rows and columns of said table and the second array just contains the first column of the table. I don't know why this happens and couldn't fix this issue.
This is the code I'm using to scrape the table:
var result = await page.$$eval('tbody > tr', rows => {
return Array.from(rows, row => {
const columns = row.querySelectorAll('td');
const arr = Array.from(columns, column => column.innerText);
if (arr.length <= 1) {
return;
}
return arr;
});
});
As you can see, I'm trying to filter this second table out of the resulting array, but, I assume that since the arrow function is already doing it's thing, simply calling return
will result in a null value inside the array. I don't want that, since the array has double the amount of values because of this issue.
I can filter out the null values using this code:
var filtered = result.filter(function (el) {
return el != null;
});
But in doing so I'm iterating the array a second time just to filter out the null values, that's double the amount of time taken to execute my routine.
My question here is: How to filter out these rows with a column count equal to or less than 1?
Edit: Even though I accepted James' answer, I should mention that the real fix to my problem was pointed out by Barmar in the comment section, I should've evaluated the page using table#filter--result-table-resumo > tbody > tr
, thus filtering out the second unwanted table.
Here is the final code:
var result = await page.$$eval('table#filter--result-table-resumo > tbody > tr', rows => {
return Array.from(rows, row => {
const columns = row.querySelectorAll('td');
return Array.from(columns, column => column.innerText);
});
});
Array.from is designed to return an array of some exact length, it's not meant for filtering out rows, for that you need array.filter.
Array.from(rows).filter(row => row.querySelectorAll('td').length > 1);
As @Barmar points out I missed your mapping. Rather than using .filter and .map (you've pointed out efficiency is important) you can combine those using a .reduce
to do both operations in one step:
Array.from(rows).reduce((acc, row) => {
const columns = row.querySelectorAll('td');
if (columns.length > 1) {
acc.push(Array.from(columns, column => column.innerText));
}
return acc;
}, []);