Search code examples
javascriptnode.jsdomindexofjsdom

Finding position of dom node in the document source


Context

I'm building a set of 'extractor' functions whose purpose is to extract what looks like components from a page (using jsdom and nodejs). The final result should be these 'component' objects ordered by where they originally appeared in the page.

Problem

The last part of this process is a bit problematic. As far as I can see, there's no easy way to tell where a given element is in a given dom document's source code.

The numeric depth or css/xpath-like path also doesn't feel helpful in this case.

Example

With the given extractors...

const extractors = [

  // Extract buttons
  dom => 
    Array.from(dom.window.document.querySelectorAll('button'))
    .map(elem => ({
      type: 'button',
      name: elem.name,
      position:        /* this part needs to be computed from elem */
    })),

  // Extract links
  dom => 
    Array.from(dom.window.document.querySelectorAll('a'))
    .map(elem => ({
      type: 'link',
      name: elem.textContent,
      position:        /* this part needs to be computed from elem */
      link: elem.href,
    })),

];

...and the given document (I know, it's an ugly and un-semantic example..):

<html>
  <body>
    <a href="/">Home</a>
    <button>Login</button>
    <a href="/about">About</a>
...

I need something like:

[
  { type: 'button', name: 'Login', position: 45, ... },
  { type: 'link', name: 'Home', position: 20, ... },
  { type: 'link', name: 'About', position: 72, ... },
]

(which can be later ordered by item.position)

For example, 45 is the position/offset of the <button with the example html string.


Solution

  • You could just iterate all the elements in the DOM and assign them an index, given your DOM doesn't change:

    const pos = new Symbol('document position');
    for (const [index, element] of document.querySelectorAll('*').entries()( {
        element[pos] = index;
    }
    

    Then your extractor can just use that:

    dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
      type: 'link',
      name: elem.textContent,
      position: elem[pos],
      link: elem.href,
    })),
    

    Alternatively, JSDOM provides a feature where it attaches the source position in the parsed HTML text to every node, you can also use that - see includeNodeLocations. The startOffset will be in document order as well. So if you parse the input with that option enabled, you can use

    dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
      type: 'link',
      name: elem.textContent,
      position: dom.nodeLocation(elem).startOffset,
      link: elem.href,
    })),