I would like to scrape the following data table from a website:
<body style="background-color:grey;">
<div class="table" id="myTable" style="display: table;">
<div class="tr" style="background-color: #4CAF50; color: white;">
<div class="td tnic">Nickname</div>
<div class="td tsrv">Server IP</div>
<div class="td tip">IP</div>
<div class="td treg">Region</div>
<div class="td tcou">Country</div>
<div class="td tcit">City</div>
<div class="td tscr">Score <input type="checkbox" onchange="mysrt(this)" id="chkscr"></div>
<div class="td tupd">Update Time <input type="checkbox" onchange="mysrt(this)" id="chkupd" checked="" disabled="">
</div>
<div class="td taut">Auth Key</div>
<div class="td town">Key Owner</div>
<div class="td tver">Version</div>
<div class="td tdet">Details</div>
</div>
<div class="tr mytarget ">
<div class="td tnic">Player 1</div>
<div class="td tsrv">_GAME_MENU_</div>
<div class="td tip">x.x.226.35</div>
<div class="td treg">North America</div>
<div class="td tcou">United States</div>
<div class="td tcit">Cleveland</div>
<div class="td tscr">21</div>
<div class="td tupd">2022-12-29 10:17:01 (GMT-8)</div>
<div class="td taut">SecretauthK3y</div>
<div class="td town">CoolName</div>
<div class="td tver">7.11</div>
<div class="td tdet">FPS: 93 @ 0(0) ms @ 0 K/m</div>
</div>
<div class="tr mytarget ">
<div class="td tnic">PlayerB</div>
<div class="td tsrv">_GAME_MENU_</div>
<div class="td tip">x.x.90.221</div>
<div class="td treg">North America</div>
<div class="td tcou">United States</div>
<div class="td tcit">Mechanicsville</div>
<div class="td tscr">67991</div>
<div class="td tupd">2022-12-29 10:16:56 (GMT-8)</div>
<div class="td taut">SecretauthK3y2</div>
<div class="td town">PlayerB</div>
<div class="td tver">7.12</div>
<div class="td tdet">FPS: 50 @ 175(243) ms @ 0 K/m</div>
</div>
<div class="tr mytarget ">
<div class="td tnic">McChicken</div>
<div class="td tsrv">_GAME_MENU_</div>
<div class="td tip">x.x.39.80</div>
<div class="td treg">North America</div>
<div class="td tcou">United States</div>
<div class="td tcit"></div>
<div class="td tscr">0</div>
<div class="td tupd">2022-12-29 09:41:44 (GMT-8)</div>
<div class="td taut">SecretauthK3y3</div>
<div class="td town">SOLO KEY</div>
<div class="td tver">7.12</div>
<div class="td tdet">FPS: 63 @ 0(0) ms @ 0 K/m</div>
</div>
</div>
It has a header row under .tr
and then each row of data is represented by the div with .tr mytarget
. Normally there are hundreds of more .tr_mytarget
rows which all have an identical format to the three shown. My goal is to scrape this data in such a way that will make it easy to then perform some calculations and filtering to it. It will eventually be re-used in a new data table.
I have a small amount of experience with JS so my idea was to use puppeteer. My question is twofold: In what format should I scrape the data so that it's in an appropriate format to use and how do I write the Puppeteer statements to do this?
This is what I have so far:
import puppeteer from 'puppeteer';
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('redactedurl.com');
await page.waitForSelector('#myTable');
const nicks = await page.$$eval(
'.table .tr_mytarget .td_tnic',
allNicks => allNicks.map(td_tnick => td_tnick.textContent)
);
await console.log(nicks);
I dont fully understand how to write the $$eval statement. I'm thinking I will want one array for the header and one for the data but I'm not sure. What's recommended?
This looks like a pretty straightforward table traversal, if I understand correctly. The problem is typical: trying to do everything in a single query call when it's better to use two; one for the rows, one for the columns.
Here's an example:
const puppeteer = require("puppeteer"); // ^19.1.0
const html = `your HTML from above`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const data = await page.$$eval("#myTable .tr.myTarget", rows =>
rows.map(row =>
[...row.querySelectorAll(".td")].map(cell => cell.textContent)
)
);
console.log(data);
})()
.catch(err => console.error(err))
.finally(() => browser?.close());
This gives a 2d array of the table. If you want an array of objects keyed by field, you can scrape the headers row, then glue it to each row of data in the array:
// ...
const headers = await page.$$eval("#myTable .tr:first-child .td", cells =>
cells.map(e => e.textContent.trim())
);
const withHeaders = data.map(e =>
Object.fromEntries(headers.map((h, i) => [h, e[i]]))
);
console.log(withHeaders);
See also: