Search code examples
rubyxpathcss-selectorsnokogirimechanize-ruby

Get all tags followings a certain with mechanize ? (ruby)


How can I get all elements following once, like :

<div id="exemple">
  <h2 class="target">foo</h2>
  <p>bla bla</p>
  <ul>
    <li>bar1</li>
    <li>bar2</li>
    <li>bar3</li>
  </ul>
  <h4>baz</h4> 
  <ul>
     <li>lot</li>
  </ul>
  <div>of</div>
  <p>possible</p>
  <p>tags</p>
  <a href="#">after</a>
</div>

I need to detect <h2 class="target"> and get all tags to the next <h4> and ignore <h4> AND all followings tags (if <h4> not exist, I have to get all tags to the end of parent [here : end of <div>])

The content is dynamic and unpredictable The only rule is : we know there is a target and there is a (or end of element). I need to get all tags beetween both and exclud all others.

With this exemple I need to get the HTML following :

<h2 class="target">foo</h2>
<p>bla bla</p>
<ul>
  <li>bar1</li>
  <li>bar2</li>
  <li>bar3</li>
</ul>

so I can get : target = page.at('#exemple .target') I know next_sibling method, but how can i test the type of tag of the current node?

I think about something like that to course the node tree :

html = ''
while not target.is_a? 'h4'
  html << target.inner_html
  target = target.next_sibling

How can I do this?


Solution

  • You can subtract the ones you don't want from your nodeset:

    h2 = page.at('h2')
    (h2.search('~ *') - h2.search('~ h4','~ h4 ~ *')).each do |el|
        # el is not a h4 and does not follow a h4
    end
    

    Maybe it makes more sense to use xpath but I can do this without googling.

    Your idea of iterating next sibling can work too:

    el = page.at('h2 ~ *')
    while el && el.name != 'h4'
        # do something with el
        el = el.at('+ *')
    end