I'm trying to scrape a website that doesn't use class or ids, and the structure is like this:
<div>
<div>
<div>
Some content
</div>
</div>
<div>
Other content
<div>
</div>
I'm trying something like doc.css('div div')
but that's returning duplicates of the content, since nested containers all match that selector.
How do I select only the bottom of the nest, knowing that they are not all the same depth?
Another way to phrase the question, is there a way to do something like "div with no div children"? It may have other children, just not divs
Edit:
Trying to clarify, with the above html I can call:
doc.css('div div').map(&:text)
To get the text of the document, divided into an array by the divs. The problem is, that line is returning "Some content" twice, because even though it exists once in the html, there are two 'div div' matches with that text.
This code finds all the leaf elements and checks if they're divs. This is what I'm assuming what you're trying to do.
// will be used to store all the leaves
const leaves = [];
// uses recursion to find all the leaves
const findLeaves = ($branch) => {
if ($branch.children.length === 0)
{
leaves.push($branch);
return;
}
[...$branch.children].forEach(($branch) => findLeaves($branch));
};
// parent element of elements you want to search through
const $branch = document.querySelector("body > div");
// initiate finding leaves
findLeaves($branch);
// remove from all the leaves non divs
const what_you_want = leaves.filter(($leaf) => $leaf.tagName === "DIV");
console.log(what_you_want);