Search code examples
node.jsjsdom

How to get image details within an anchor element <a><img src></a>


I am using nodejs and jsdom, attempting to retrieve images used within anchor tags.

I can enumerate the images using .querySelectorAll("img"); and the anchors using .querySelectorAll("img");.

But I can't seem to find the relationship between the two which is the part that I am after, to know that the image displayed when clicked navigates to x.

sample html

<a href="http://www.yahoo.com">
  <img src="https://s.yimg.com/nq/nr/img/yahoo_mail_global_english_white_1x.png" alt="Yahoo Mail Image">
</a>

Node.js

var links = dom.window.document.querySelectorAll("a");
    links.forEach(function(value){
         console.log('Host: ' + value.hostname);
         console.log('Href: ' + value.href);
         console.log('Text: ' + value.text);
         console.log('HTML: ');
         console.dir(value);
    });

Expected result:

link to x is displayed with image.alt "yahoo mail image" and image.src "https://...."


Solution

  • Without seeing your HTML context, I can suggest running queries within the link subtrees:

    const {JSDOM} = require("jsdom"); // ^22.0.0
    
    const html = `
    <a href="http://www.yahoo.com">
      <img src="https://s.yimg.com/nq/nr/img/yahoo_mail_global_english_white_1x.png" alt="Yahoo Mail Image">
    </a>
    <a href="http://www.google.com">
      <img src="google.png" alt="Google Image">
    </a>
    <a href="http://www.example.com">
      <img src="whatever.png" alt="Whatever Image">
    </a>`;
    
    const {window: {document}} = new JSDOM(html);
    const data = [...document.querySelectorAll("a")].map(e => ({
      src: e.querySelector("img").src,
      alt: e.querySelector("img").getAttribute("alt"),
      href: e.href,
    }));
    console.log(data);
    

    Output:

    [
      {
        src: 'https://s.yimg.com/nq/nr/img/yahoo_mail_global_english_white_1x.png',
        alt: 'Yahoo Mail Image',
        href: 'http://www.yahoo.com/'
      },
      {
        src: 'google.png',
        alt: 'Google Image',
        href: 'http://www.google.com/'
      },
      {
        src: 'whatever.png',
        alt: 'Whatever Image',
        href: 'http://www.example.com/'
      }
    ]
    

    However, it's likely that there are other links on the page you're working with, so I would add a parent container to refine your a selector, which is probably too broad and will attempt to grab links that don't have <img> tags as children.

    Using the sizzle pseudoselector a:has(img), xpath, or a fiter (shown below) might also help:

    const data = [...document.querySelectorAll("a")]
      .filter(e => e.querySelector(":scope > img"))
      .map(e => ({
        src: e.querySelector("img").src,
        alt: e.querySelector("img").getAttribute("alt"),
        href: e.href,
      }));
    

    ...but this is speculation.