Search code examples
javascriptnode.jsweb-scrapingcheerio

Cheerio cannot find table by id


I'm trying to scrape https://www.baseball-reference.com/players/p/pujolal01.shtml for player stats, specifically getting from the Standard Batting and Player Value--Batting tables. Here's part of my code:

const page = cheerio.load(response.data);
const statsTable = page('#batting_standard');
const rows = statsTable.find('tbody > tr').not('.minors_table').add(statsTable.find('tfoot > tr:first'));
const moreStatsTable = page('#batting_value');
const moreRows = moreStatsTable.find('tbody > tr, tfoot > tr:first');

For some reason, it's able to retrieve the first table (id = 'batting_standard'), but not the second (id = 'batting_value'), such that moreStatsTable = null. What's going on? I don't understand why cheerio can't find the value table, since it has a unique id. Is it just me having this issue?


Solution

  • Expanding on chitown88's comment, the data you want appears to be inside comments. The site uses JS after the page loads to display the HTML from these comments.

    There's a useful Cheerio GitHub issue #423 which has a method of identifying and extracting data from comments. I adapted this to your use case to find the particular table you want:

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    
    const url = "https://www.baseball-reference.com/players/p/pujolal01.shtml";
    
    fetch(url) // Node 18 or install node-fetch, or use another library like axios
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html);
    
        $("*").map((i, el) => {
          $(el).contents().map((i, el) => {
            if (el.type === "comment") {
              const $ = cheerio.load(el.data);
              const table = $("#batting_value").first();
    
              if (table.length) {
                const data = [...table.find("tr")].map(e =>
                  [...$(e).find("td, th")].map(e => $(e).text().trim())
                );
                // trim the table a bit for display
                console.table(data.slice(0, 4).map(e => e.slice(0, 4)));
              }
            }
          });
        });
      });
    

    Output:

    ┌─────────┬────────┬───────┬───────┬──────┐
    │ (index) │   0    │   1   │   2   │  3   │
    ├─────────┼────────┼───────┼───────┼──────┤
    │    0    │ 'Year' │ 'Age' │ 'Tm'  │ 'Lg' │
    │    1    │ '2001' │ '21'  │ 'STL' │ 'NL' │
    │    2    │ '2002' │ '22'  │ 'STL' │ 'NL' │
    │    3    │ '2003' │ '23'  │ 'STL' │ 'NL' │
    └─────────┴────────┴───────┴───────┴──────┘