Context
I'm building a set of 'extractor' functions whose purpose is to extract what looks like components from a page (using jsdom and nodejs). The final result should be these 'component' objects ordered by where they originally appeared in the page.
Problem
The last part of this process is a bit problematic. As far as I can see, there's no easy way to tell where a given element is in a given dom document's source code.
The numeric depth or css/xpath-like path also doesn't feel helpful in this case.
Example
With the given extractors...
const extractors = [
// Extract buttons
dom =>
Array.from(dom.window.document.querySelectorAll('button'))
.map(elem => ({
type: 'button',
name: elem.name,
position: /* this part needs to be computed from elem */
})),
// Extract links
dom =>
Array.from(dom.window.document.querySelectorAll('a'))
.map(elem => ({
type: 'link',
name: elem.textContent,
position: /* this part needs to be computed from elem */
link: elem.href,
})),
];
...and the given document (I know, it's an ugly and un-semantic example..):
<html>
<body>
<a href="/">Home</a>
<button>Login</button>
<a href="/about">About</a>
...
I need something like:
[
{ type: 'button', name: 'Login', position: 45, ... },
{ type: 'link', name: 'Home', position: 20, ... },
{ type: 'link', name: 'About', position: 72, ... },
]
(which can be later ordered by item.position
)
For example, 45 is the position/offset of the <button
with the example html string.
You could just iterate all the elements in the DOM and assign them an index, given your DOM doesn't change:
const pos = new Symbol('document position');
for (const [index, element] of document.querySelectorAll('*').entries()( {
element[pos] = index;
}
Then your extractor can just use that:
dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
type: 'link',
name: elem.textContent,
position: elem[pos],
link: elem.href,
})),
Alternatively, JSDOM provides a feature where it attaches the source position in the parsed HTML text to every node, you can also use that - see includeNodeLocations
. The startOffset
will be in document order as well. So if you parse the input with that option enabled, you can use
dom => Array.from(dom.window.document.querySelectorAll('a'), elem => ({
type: 'link',
name: elem.textContent,
position: dom.nodeLocation(elem).startOffset,
link: elem.href,
})),