Search code examples
javascriptweb-scrapingpuppeteercheerio

Only keep the elements with text in them and remove all other elements


I am trying to scrape a website using puppeteer and cheerio. I have gotten the html of the page I want to scrape using puppeteer. I have loaded that html into cheerio.

async function run() {
  const browser = await puppeteer.launch({ headless: false });
  const page = await browser.newPage();
  const html = await get_to_page_with_required_source_code(page);
  const $ = cheerio.load(html);
  await browser.close();
}

run();

What I want to do now is remove all elements from the HTML that contains no text. Below is an example.

<div class="abc">
    <img src="..." />
</div>
<div class="def">
    <div class="jkl">
        <span class="ghi">This is a text</span>
    </div>
    <div class="mno">This is another text</div>
</div>

The output of the above HTML should be:

<span class="ghi">This is a text</span>
<div class="mno">This is another text</div>

since these are the only two elements that contain text in them.

How can I accomplish this?


Solution

  • For starters, generally don't combine Puppeteer and Cheerio. Either the site is dynamic, in which case use Puppeteer and work directly with the live DOM (use jQuery if you like Cheerio syntax), or if the site is static, use fetch and Cheerio alone and skip the Puppeteer slowness.

    Here's one way to do it with Cheerio (you can toss in fetch to request the data if it's a static site):

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    const html = `
    <div class="abc">
        <img src="..." />
    </div>
    <div class="def">
        <div class="jkl">
            <span class="ghi">This is a text</span>
        </div>
        <div class="mno">This is another text</div>
    </div>`;
    
    const $ = cheerio.load(html);
    const textEls = [...$("*")]
      .filter(e => $(e).children().length === 0 && $(e).text().trim())
      .map(e => $.html($(e)));
    console.log(textEls);
    

    Here's how to do it with Puppeteer:

    const puppeteer = require("puppeteer"); // ^22.10.0
    
    const html = `<Same as above>`;
    
    let browser;
    (async () => {
      browser = await puppeteer.launch();
      const [page] = await browser.pages();
      await page.setContent(html, {waitUntil: "domcontentloaded"});
      const textEls = await page.$$eval("*", els =>
        els
          .filter(e => e.children.length === 0 && e.textContent.trim())
          .map(e => e.outerHTML)
      );
      console.log(textEls);
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close());
    

    Output is the same in both (you can add .join("\n") if you want your output to be a string exactly as you posted it):

    [
      '<span class="ghi">This is a text</span>',
      '<div class="mno">This is another text</div>'
    ]
    

    Keep in mind: this is a bit of an odd thing to want to do, so there might be a better way to achieve whatever you're really trying to achieve.

    Disclosure: I'm the author of the linked blog post.