Search code examples
javascriptweb-scrapingcheerio

HTML Scraping with Cheerio


I have a hard time finding out how to iterate the children of paragraph elements using Cheerio.

<html>
<head>
</head>
<body>
<p>Hello, this is me - Daniel</p>
<p><strong>Hello</strong>, this is me - Daniel</p>
<p>Hello, <strong>this is me</strong> - Daniel</p>
<p>Hello, this is me - <strong>Norbert</strong></p>
<p><strong>Hello</strong>, this is me - <strong>Daniel</strong></p>
</body>
</html>

Using find('*') or children('*'), Cheerio only returns the <strong> tags but not the plain-text one's.

What I need is a list of all nested elements (beneath <p>) including the plain text.


Solution

  • Try contents():

    const cheerio = require("cheerio"); // 1.0.0-rc.12
    
    const html = `
    <p>Hello, this is me - Daniel</p>
    <p><strong>Hello</strong>, this is me - Daniel</p>
    <p>Hello, <strong>this is me</strong> - Daniel</p>
    <p>Hello, this is me - <strong>Norbert</strong></p>
    <p><strong>Hello</strong>, this is me - <strong>Daniel</strong></p>
    `;
    const $ = cheerio.load(html);
    const text = [...$("p").contents()].map(e => $(e).text());
    console.log(text);
    

    Output:

    [
      'Hello, this is me - Daniel',
      'Hello',
      ', this is me - Daniel',
      'Hello, ',
      'this is me',
      ' - Daniel',
      'Hello, this is me - ',
      'Norbert',
      'Hello',
      ', this is me - ',
      'Daniel'
    ]