Search code examples
javascriptcheerio

Flatten nested array of titles and multiple children in Cheerio


Using Cheerio, I need to iterate on multiple levels to access elements. How can I use nested iteration to access elements and return an array of objects? Currently with my code, because of the nested loop, I am returning an array of arrays of objects.

the html:

<body style="overflow: hidden">
    <div class="cookie-box"></div>
    <div id="next">
        <div></div>
        <div>
            <main id="main">
                <article>
                    <div>
                       <h2>title</h2>
                    </div>
                    <div></div>
                    <div class="layout">
                        <form></form>
                        <section aria-labelledby="Train">
                          <ul>
                            <li>
                               <p>title</p>
                               <span>
                                 <a>
                                    <span>123</span>
                                 </a>
                                 <a>
                                   <span>456</span>
                                 </a>
                                 ...
                               </span>
                              ...
                           </li>
                           <li></li>
                           ...
                      </ul>
                   </section>
                    </div>
                </article>
            </main>
        </div>
    </div>
</body>

the code:

const data = [...$('[aria-labelledby="Train"] li')].map(e => { 
return [...$item.find('span a')].map(elem=> {
  return {
     p: $(e).find("p").text(),
     number: $(elem).find('span').text()
  }
})

});

what I would want as output:

[
  {
    p: "title",
    num: 123
  },
  {
    p: "title",
    num: 456
  }
]

what I get:

[
 [
  {
    p: "title",
    num: 123
  },
  {
    p: "title",
    num: 456
  }
 ]
]

Solution

  • You can use flatMap to remove that layer of nesting:

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const html = `
    <section aria-labelledby="Train">
      <ul>
        <li>
          <p>title</p>
          <span>
            <a>
              <span>123</span>
            </a>
            <a>
              <span>456</span>
            </a>
          </span>
        </li>
      </ul>
    </section>`;
    
    const $ = cheerio.load(html);
    const data = [...$('[aria-labelledby="Train"] li')].flatMap(li =>
      [...$(li).find("span a")].map(a => ({
        p: $(li).find("> p").text().trim(),
        number: $(a).text().trim(),
      }))
    );
    console.log(data);
    

    Another approach that removes the nested loop is to iterate the inner elements and use .closest() to pop back up to the list item container to find the title:

    const data = [...$('[aria-labelledby="Train"] li span a')].map(
      a => ({
        p: $(a).closest("li").find("> p").text().trim(),
        number: $(a).text().trim(),
      })
    );
    

    This violates one of my rules of thumb in web scraping, "work top down, not bottom up", but both seem pretty acceptable in this case. The "bottom up" approach can get messy as soon as you need more elements from the container, or in cases when the parent element is liable to change unexpectedly. If you find you're reaching upward to different elements multiple times, consider going back to a top down approach.

    See Scrape Table With Merge Header and cheeriojs select tags that are not inside another specified tag for more complex examples of flattening a nested array in Cheerio.