Search code examples
cssnokogiri

Select only the bottom of nested divs, without knowing how nested they are


I'm trying to scrape a website that doesn't use class or ids, and the structure is like this:

<div>
  <div>
    <div>
      Some content
    </div>
  </div>
  <div>
    Other content
  <div>
</div>

I'm trying something like doc.css('div div') but that's returning duplicates of the content, since nested containers all match that selector.

How do I select only the bottom of the nest, knowing that they are not all the same depth?

Another way to phrase the question, is there a way to do something like "div with no div children"? It may have other children, just not divs

Edit:

Trying to clarify, with the above html I can call:

doc.css('div div').map(&:text)

To get the text of the document, divided into an array by the divs. The problem is, that line is returning "Some content" twice, because even though it exists once in the html, there are two 'div div' matches with that text.


Solution

  • This code finds all the leaf elements and checks if they're divs. This is what I'm assuming what you're trying to do.

    // will be used to store all the leaves
    const leaves = [];
    
    // uses recursion to find all the leaves 
    const findLeaves = ($branch) => {
        if ($branch.children.length === 0)
        {
            leaves.push($branch);
            return;
        }
        [...$branch.children].forEach(($branch) => findLeaves($branch));
    };
    
    
    // parent element of elements you want to search through
    const $branch = document.querySelector("body > div");
    
    // initiate finding leaves
    findLeaves($branch);
    
    // remove from all the leaves non divs
    const what_you_want = leaves.filter(($leaf) => $leaf.tagName === "DIV");
    console.log(what_you_want);