Search code examples
node.jsweb-scrapingcheerio

Cheerio Why cant I access elements correctly?


Thee html:

<body style="overflow: hidden">
    <div class="cookie-box"></div>
    <div id="next">
        <div></div>
        <div>
            <main id="main">
                <article>
                    <div>
                       <h2>title</h2>
                    </div>
                    <div></div>
                    <div class="layout">
                        <form></form>
                        <section aria-labelledby="Train">
                            <ul>
                                <li>
                                    <p>title</p>
                                    ...
                                </li>
                                <li></li>
                                ...
                            </ul>
                        </section>
                    </div>
                </article>
            </main>
        </div>
    </div>
</body>

Im trying the iterate thought the list (li elements):

$('#__next div #main article .layout .section[aria-labelledby=Train] > ul > li').map((_, item)=> {
 const $item = $(item);
 //here accessing the elements inside
})

However, Im not even getting inside the map.

Also when I try to access just one element

 $('#__next div #main article .layout div:nth-of-type(1) h2').text()

I get a huge list of all h2 elements indie the content, even though I dont iterate and only access a sepecific h2 element.

What do I need to do differently? Thanks!!


Solution

  • Multiple properties don't exist in your selector, like #__next and .section (should be #next and section, respectively).

    But there's no clear need to specify the hierarchy as strictly as you're doing. Choose the minimum to reliably disambiguate:

    import cheerio from "cheerio"; // ^1.0.0-rc.12
    
    const html = `
    <main id="main">
      <article>
        <div>
          <h2>h2 title</h2>
        </div>
        <div class="layout">
          <section aria-labelledby="Train">
            <ul>
              <li>
                <p>para 1</p>
              </li>
              <li>
                <p>para 2</p>
              </li>
            </ul>
          </section>
        </div>
      </article>
    </main>`;
    
    const $ = cheerio.load(html);
    const data = [...$('[aria-labelledby="Train"] li')].map(e => ({
      p: $(e).find("p").text(),
      // other selectors within the <li>
    }));
    const title = $("#main h2").first().text();
    console.log(data); // => [ { p: 'para 1' }, { p: 'para 2' } ]
    console.log(title); // => h2 title
    

    Choosing fewer selectors means you're less liable to encounter false negatives (for example, the selector breaking because one superfluous div in the chain disappeared), at the risk of false positives (for example, selecting something you don't intend to select, because insufficient disambiguation was used).

    If you wind up with text joined together, then you need to increase specificity. Maybe you want to loop over each item and call .text() on it. Maybe you need to make the selector more precisely targeted to the data you want. If you just want one element, the "easiest" way to do this is with .first(), .last(), .eq(1), :nth-of-type(1), etc, but these can be unreliable and are suboptimal relative to choosing a precise selector based on non-positional ids, roles or classes.

    You might want to share more context if this isn't enough to get you moving again, because it's not really clear where the other <h2> you're inadvertently pulling in actually lives in the HTML structure, or what data you want to scrape from <li>s exactly, and what other structures on the page might be conflicting with the one you've shown.