I've been working on a puppeteer app to scrape some data.
I've got this code which works fine but could be improved to give me the data I want to improve it to get the data in a structured way that I can work with.
const table1 = await page.$$eval('table:nth-child(3) tbody', tbodys => tbodys.map((tbody) => {
return tbody.innerText;
}));
So tbody allows me to scrape all the TR and TD tags no matter the number of those in the table however I've a problem in that my table has a table row and within that table row it has two table cells. The first TD is the header of the data in the second TD.
So I have the following HTML:
<tr class="header1"><th colspan="2">COS-MOD-000-CAB-PAP-123202</th></tr>
body > center > table > tbody > tr:nth-child(2) > td:nth-child(2) > div:nth-child(3) > table:nth-child(3) > tbody > tr:nth-child(2)
//THIS IS THE BODY WHICH MY ORIGINAL CODE IS PULLING OUT THE TEXT OF. MY CODE LOOKS AT TDS ONLY WITHIN TRs.
<tbody><tr class="header1"><th colspan="2">COS-MOD-000-CAB-PAP-123202</th></tr>
<tr class="light">
<td style="text-align: right; width: 100px;"><strong>Status:</strong></td>//HEADER
<td valign="top">Wrong </td> //VALUE
</tr>
<tr class="dark">
<td style="text-align: right; width: 100px;"><strong>Created:</strong></td>//HEADER
<td valign="top">2019-09-09 17:18:53 </td>//VALUE
</tr>
<tr class="light">
<td style="text-align: right; width: 100px;"><strong>Modified:</strong></td>//HEADER
<td valign="top">2019-09-09 17:21:19 </td>//VALUE
</tr>
<tr class="dark">
<td style="text-align: right; width: 100px;"><strong>User:</strong></td>//HEADER
<td valign="top">fbibsan </td>//VALUE
</tr>
<tr class="light">
<td style="text-align: right; width: 100px;"><strong>BMS Account:</strong></td> //HEADER
<td valign="top">ABC123 SAS. (SAS) </td> //VALUE
</tr>
<tr class="dark">
<td style="text-align: right; width: 100px;"><strong>Mode:</strong></td>//HEADER
<td valign="top">FAF </td>//VALUE
</tr>
<tr class="light">
<td style="text-align: right; width: 100px;"><strong>Type:</strong></td>
<td valign="top">BOP </td>
</tr>
</tbody>
The structure I need is for each row in the table:
HEADER:'VALUE'
I hope someone could help. I'd be very grateful as I've spent days searching now.
If I undestand the task correctly, here is a simplified example how to get structured data from a table:
const html = `
<!doctype html>
<html>
<head><meta charset='UTF-8'><title>Test</title></head>
<body>
<table><tbody>
<tr><th>Header</th><th>Header</th></tr>
<tr><td>Key 1</td><td>Value 1</td></tr>
<tr><td>Key 2</td><td>Value 2</td></tr>
</tbody></table>
</html>`;
const puppeteer = require('puppeteer');
(async function main() {
try {
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
const [page] = await browser.pages();
await page.goto(`data:text/html,${html}`);
const data = await page.evaluate(() => {
const dataObject = {};
const tbody = document.querySelector('table tbody');
for (const row of tbody.rows) {
if (!row.querySelector('td')) continue; // Skip headers.
const [keyCell, valueCell] = row.cells;
dataObject[keyCell.innerText] = valueCell.innerText;
}
return dataObject;
});
console.log(data); // { 'Key 1': 'Value 1', 'Key 2': 'Value 2' }
// await browser.close();
} catch (err) {
console.error(err);
}
})();