Search code examples
javascriptnode.jsxmlcheeriocdata

Get image from figure tag using cheerio on xml


I am trying to extract the img src from the following xml tag inside of an item

I am calling cheerio.load on my response data like so

const $ = cheerio.load(response.data, { xmlMode: true });
    $("item").each((i, item) => {

and I am coming across this specific tag in item that I want to extract the img src from

<figure class="wp-block-image size-large">
<img decoding="async" loading="lazy" width="800" height="572" src="http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-800x572.jpeg" alt="" class="wp-image-43535" srcset="http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-800x572.jpeg 800w, http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-350x250.jpeg 350w, http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2-768x549.jpeg 768w, http://wmcmuaythai.org/wp-content/uploads/2023/04/WhatsApp-Image-2023-04-07-at-3.18.13-PM-2.jpeg 1024w" sizes="(max-width: 800px) 100vw, 800px" />
</figure>

I have tried the following cheerio queries and either keep getting undefined or not what I want.

$(item).find("figure").find("img").attr("src")
$(item).find("img").attr("src")
$(item).find("figure").children().find("img").attr("src")
$(item).find("figure").first().find("img").attr("src")

This is the rss feed in which I am trying to extract the figure from

http://wmcmuaythai.org/feed/


Solution

  • I'm not too familiar with XML but the tags you want look like they're inside CDATA. I've had success in the past by loading the CDATA text into Cheerio, then traversing that inner structure.

    const cheerio = require("cheerio"); // ^1.0.0-rc.12
    
    fetch("<Your URL>")
      .then(res => {
        if (!res.ok) {
          throw Error(res.statusText);
        }
    
        return res.text();
      })
      .then(html => {
        const $ = cheerio.load(html, {xml: true});
        const result = [...$("content\\:encoded")].flatMap(e =>
          [...$.load($(e).text())("img")].map(e => $(e).attr("src"))
        );
        console.log(result);
        console.log(result.length); // => 51
      })
      .catch(err => console.error(err));
    

    You may want to unflatten the map to maintain the groupings, depending on whatever your expected result is.