I'm trying to scrape https://www.baseball-reference.com/players/p/pujolal01.shtml for player stats, specifically getting from the Standard Batting and Player Value--Batting tables. Here's part of my code:
const page = cheerio.load(response.data);
const statsTable = page('#batting_standard');
const rows = statsTable.find('tbody > tr').not('.minors_table').add(statsTable.find('tfoot > tr:first'));
const moreStatsTable = page('#batting_value');
const moreRows = moreStatsTable.find('tbody > tr, tfoot > tr:first');
For some reason, it's able to retrieve the first table (id = 'batting_standard'), but not the second (id = 'batting_value'), such that moreStatsTable = null
. What's going on? I don't understand why cheerio can't find the value table, since it has a unique id. Is it just me having this issue?
Expanding on chitown88's comment, the data you want appears to be inside comments. The site uses JS after the page loads to display the HTML from these comments.
There's a useful Cheerio GitHub issue #423 which has a method of identifying and extracting data from comments. I adapted this to your use case to find the particular table you want:
const cheerio = require("cheerio"); // 1.0.0-rc.12
const url = "https://www.baseball-reference.com/players/p/pujolal01.shtml";
fetch(url) // Node 18 or install node-fetch, or use another library like axios
.then(res => {
if (!res.ok) {
throw Error(res.statusText);
}
return res.text();
})
.then(html => {
const $ = cheerio.load(html);
$("*").map((i, el) => {
$(el).contents().map((i, el) => {
if (el.type === "comment") {
const $ = cheerio.load(el.data);
const table = $("#batting_value").first();
if (table.length) {
const data = [...table.find("tr")].map(e =>
[...$(e).find("td, th")].map(e => $(e).text().trim())
);
// trim the table a bit for display
console.table(data.slice(0, 4).map(e => e.slice(0, 4)));
}
}
});
});
});
Output:
┌─────────┬────────┬───────┬───────┬──────┐
│ (index) │ 0 │ 1 │ 2 │ 3 │
├─────────┼────────┼───────┼───────┼──────┤
│ 0 │ 'Year' │ 'Age' │ 'Tm' │ 'Lg' │
│ 1 │ '2001' │ '21' │ 'STL' │ 'NL' │
│ 2 │ '2002' │ '22' │ 'STL' │ 'NL' │
│ 3 │ '2003' │ '23' │ 'STL' │ 'NL' │
└─────────┴────────┴───────┴───────┴──────┘