Search code examples
cheerio

Extracting Multiple Child Elements from a Parent using Cheerio


I'm trying to use Cheerio to scrape data and ultimately convert the resultant HTML to Markdown.

While not core to this question, to convert to Markdown, all I need is some valid HTML. Specifically, for this case, a div with one or more <ul> tags.

I mention this so it's clear that I'm not using the resultant HTML to directly render, but I need it in a form that I can use to convert to Markdown.

Using the simplified example below and given a known class name of "things", there are two <ul> tags in the parent div.

Note that the ul tags do not have a class or id in the code I'm scraping.

<div class="things"> // <= want
    <h5 class="heading">Things</h5> // <= don't want
    <ul> // <= want with children
        <li class="sub-heading">Fruits</li>
        <li class="fruit-item">Apple</li>
        <li class="fruit-item">Pear</li>
    </ul>
    <ul> // <= want with children
        <li class="sub-heading">Veg</li>
        <li class="veg-item">Carrot</li>
        <li class="veg-item">Spinach</li>
    </ul>
</div>

I want every ul with their list items in a surrounding div.

The following results HTML w/o a surrounding div and with stuff I don't want (e.g. <h5 class="heading">Things</h5>):

const stuffIWant = $(".things").html();

The following results HTML w/o a surrounding div, only the contents on one of the <ul> tags, not the ul itself:

const stuffIWant = $(".things ul").html();

I know that this is because .html() returns the first element, so I'm just getting the list items from the first ul.

This my problem and is where I'm confusing myself.

I've also tried various forms of filter, map, and each, but I can't, for the life of me, get multiple <ul> tags returned in an enclosing div.

I'm thinking maybe I need iterate through the "things" div, using each or map and append the elements I want to a new div (somehow?), but that seems more complicated than it should be, so I'm asking here.

Any advice toward helping me wrap my head around this would be much appreciated.

Thanks.


Solution

  • While this post wasn't clarified completely, it seems there are two ways to interpret it. One possibility is that you want all of the <li>s for each of your <ul>s in a series of arrays:

    const $ = cheerio.load(html);
    const result = [...$(".things ul")].map(e =>
      [...$(e).find("li")].map(e => $(e).text())
    );
    console.log(result);
    

    Which gives

    [
      [ 'Fruits', 'Apple', 'Pear' ],
      [ 'Veg', 'Carrot', 'Spinach' ],
    ]
    

    Now, if the <div class="things"> wrapper is repeated and you want to distinguish each of these groups, you can modify the above code as follows:

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    
    const html = `
    <div class="things">
      <h5 class="heading">Things</h5>
      <ul>
        <li class="sub-heading">Fruits</li>
        <li class="fruit-item">Apple</li>
        <li class="fruit-item">Pear</li>
      </ul>
      <ul>
        <li class="sub-heading">Veg</li>
        <li class="veg-item">Carrot</li>
        <li class="veg-item">Spinach</li>
      </ul>
    </div>
    <div class="things">
      <h5 class="heading">Things 2</h5>
      <ul>
        <li class="sub-heading">Foo</li>
        <li class="fruit-item">Bar</li>
        <li class="fruit-item">Baz</li>
      </ul>
    </div>
    `;
    
    const $ = cheerio.load(html);
    const result = [...$(".things")].map(e =>
      [...$(e).find("ul")].map(e =>
        [...$(e).find("li")].map(e => $(e).text())
      )
    );
    console.log(JSON.stringify(result, null, 2));
    

    This gives:

    [
      [
        [
          "Fruits",
          "Apple",
          "Pear"
        ],
        [
          "Veg",
          "Carrot",
          "Spinach"
        ]
      ],
      [
        [
          "Foo",
          "Bar",
          "Baz"
        ]
      ]
    ]
    

    In other words, there's an extra layer:

    - .things
      - ul
        - li
    

    as opposed to the top code, which flattens .things:

    - .things ul
      - li