Search code examples
node.jspuppeteercheerio

Cheerio how to remove DOM elements from selection


I am trying to write a bot to convert a bunch of HTML pages to markdown, in order to import them as Jekyll document. For this, I use puppeteer to get the HTML document, and cheerio to manipulate it.

The source HTML is pretty complex, and polluted with Google ADS tags, external scripts, etc. What I need to do, is to get the HTML content of a predefined selector, and then remove elements that match a predefined set of selectors from it in order to get a plain HTML with just the text and convert it to markdown.

Assume the source html is something like this:

<html>
<head />
<body>
<article class="post">
<h1>Title</h1>
<p>First paragraph.</p>
<script>That for some reason has been put here</script>
<p>Second paragraph.</p>
<ins>Google ADS</ins>
<p>Third paragraph.</p>
<div class="related">A block full of HTML and text</div>
<p>Forth paragraph.</p>
</article>
</body>
</html>

What I want to achieve is something like

<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
<p>Forth paragraph.</p>

I defined an array of selectors that I want to strip from the source object:

stripFromText: ['.social-share', 'script', '.adv-in', '.postinfo', '.postauthor', '.widget', '.related', 'img', 'p:empty', 'div:empty', 'section:empty', 'ins'],

And wrote the following function:

const getHTMLContent = async ($, selector) => {
  let value;
  try {
    let content = await $(selector);
    for (const s of SELECTORS.stripFromText) {
      // 1
      content = await content.remove(s);
      // 2
      // await content.remove(s);
      // 3
      // content = await content.find(s).remove();
      // 4
      // await content.find(s).remove();
      // 5
      // const matches = await content.find(s);
      // for (m of matches) {
      //  await m.remove();
      // }
    };
    value = content.html();
  } catch(e) {
    console.log(`- [!] Unable to get ${selector}`);
  }
  console.log(value);
  return value;
};

Where

  • $ is the cheerio object containing const $ = await cheerio.load(html);
  • selector is the dome selector for the container (in the example above it would be .post)

What I am unable to do, is to use cheerio to remove() the objects. I tried all the 5 versions I left commented in the code, but without success. Cheerio's documentation didn't help so far, and I just found this link but the proposed solution did not work for me.

I was wondering if someone more experienced with cheerio could point me in the right direction, or explain me what I am missing here.


Solution

  • I found a classical newby error in my code, I was missing an await before the .remove() call.

    The working function now looks like this, and works:

    const getHTMLContent = async ($, selector) => {
      let value;
      try {
        let content = await $(selector);
        for (const s of SELECTORS.stripFromText) {
          console.log(`--- Stripping ${s}`);
          await content.find(s).remove();
        };
        value = await content.html();
      } catch(e) {
        console.log(`- [!] Unable to get ${selector}`);
      }
      return value;
    };