Search code examples
rubyhtml-parsingnokogiri

Nokogiri only get list items with links first


I have a document that looks like the following:

<ul>
  <li>
    <a href="/Synergies">Link</a>Content
  </li>
  </li>
    Content <a href="/Synergies">Link</a>
  </li>
</ul>

I would like to only obtain the list items that start with an <a> tag, i.e. the first <li> would be a hit but the second would not.

I tried getting all list items and regex matching on the html content but it doesn't appear to be working:

list.search('li').each do |item|
  if /^<a href="\/Synergies".*$/.match(item) 
    puts link # hit?
  end
end

Any advice would be appreciated!


Solution

  • You can check whether the item's first child is either not text or empty text:

    list.search('li').each do |item|
      if !item.children.first.text? || item.children.first.text.strip.empty?
        puts item # hit?
      end
    end
    

    If you want to exclude items that don't begin with a link, you can select the first child and check its parents in the condition:

    list.search('li > a:first-child').each do |item|
      if !item.parent.children.first.text? || item.parent.children.first.text.strip.empty?
        puts item # hit?
      end
    end