Search code examples
javascriptweb-scrapingjavascript-objectspuppeteer

Puppeteer - scrape data from table in correct format


I've been working on a puppeteer app to scrape some data.

I've got this code which works fine but could be improved to give me the data I want to improve it to get the data in a structured way that I can work with.

const table1 = await page.$$eval('table:nth-child(3) tbody', tbodys => tbodys.map((tbody) => {
  return tbody.innerText;
}));

So tbody allows me to scrape all the TR and TD tags no matter the number of those in the table however I've a problem in that my table has a table row and within that table row it has two table cells. The first TD is the header of the data in the second TD.

So I have the following HTML:

<tr class="header1"><th colspan="2">COS-MOD-000-CAB-PAP-123202</th></tr>

body > center > table > tbody > tr:nth-child(2) > td:nth-child(2) > div:nth-child(3) > table:nth-child(3) > tbody > tr:nth-child(2)

//THIS IS THE BODY WHICH MY ORIGINAL CODE IS PULLING OUT THE TEXT OF. MY CODE LOOKS AT TDS ONLY WITHIN TRs.
<tbody><tr class="header1"><th colspan="2">COS-MOD-000-CAB-PAP-123202</th></tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>Status:</strong></td>//HEADER
    <td valign="top">Wrong&nbsp;</td> //VALUE
</tr>
<tr class="dark">
    <td style="text-align: right; width: 100px;"><strong>Created:</strong></td>//HEADER
    <td valign="top">2019-09-09 17:18:53&nbsp;</td>//VALUE
</tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>Modified:</strong></td>//HEADER
    <td valign="top">2019-09-09 17:21:19&nbsp;</td>//VALUE
</tr>
<tr class="dark">
    <td style="text-align: right; width: 100px;"><strong>User:</strong></td>//HEADER
    <td valign="top">fbibsan&nbsp;</td>//VALUE
</tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>BMS Account:</strong></td> //HEADER
    <td valign="top">ABC123 SAS. (SAS)&nbsp;</td> //VALUE
</tr>
<tr class="dark">
    <td style="text-align: right; width: 100px;"><strong>Mode:</strong></td>//HEADER
    <td valign="top">FAF&nbsp;</td>//VALUE
</tr>
<tr class="light">
    <td style="text-align: right; width: 100px;"><strong>Type:</strong></td>
    <td valign="top">BOP&nbsp;</td>
</tr>
</tbody>

The structure I need is for each row in the table:

HEADER:'VALUE'

I hope someone could help. I'd be very grateful as I've spent days searching now.


Solution

  • If I undestand the task correctly, here is a simplified example how to get structured data from a table:

    const html = `
      <!doctype html>
      <html>
        <head><meta charset='UTF-8'><title>Test</title></head>
        <body>
          <table><tbody>
            <tr><th>Header</th><th>Header</th></tr>
            <tr><td>Key 1</td><td>Value 1</td></tr>
            <tr><td>Key 2</td><td>Value 2</td></tr>
          </tbody></table>
      </html>`;
    
    const puppeteer = require('puppeteer');
    
    (async function main() {
      try {
        const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
        const [page] = await browser.pages();
    
        await page.goto(`data:text/html,${html}`);
    
        const data = await page.evaluate(() => {
          const dataObject = {};
          const tbody = document.querySelector('table tbody');
    
          for (const row of tbody.rows) {
            if (!row.querySelector('td')) continue; // Skip headers.
    
            const [keyCell, valueCell] = row.cells;
            dataObject[keyCell.innerText] = valueCell.innerText;
          }
          return dataObject;
        });
    
        console.log(data); // { 'Key 1': 'Value 1', 'Key 2': 'Value 2' }
    
        // await browser.close();
      } catch (err) {
        console.error(err);
      }
    })();