Search code examples
rubydomxpathhtml-parsingnokogiri

How to find text across HTML tag boundaries?


I have HTML like this:

<div>Lorem ipsum <b>dolor sit</b> amet.</div>

How can I find a plain text based match for my search string ipsum dolor in this HTML? I need the start and end XPath node pointers for the match, plus character indexes to point inside these start and stop nodes. I use Nokogiri to work with the DOM, but any solution for Ruby is fine.

Difficulty:

  • I can't node.traverse {|node| … } through the DOM and do a plain text search whenever a text node comes across, because my search string can cross tag boundaries.

  • I can't do a plain text search after converting the HTML to plain text, because I need the XPath indexes as result.

I could implement it myself with basic tree traversal, but before I do I'm asking if there is a Nokogiri function or trick to do it more comfortably.


Solution

  • In the end, we used code as follows. It is shown for the example given in the question, but also works in the generic case of arbitrary-depth HTML tag nesting. (Which is what we need.)

    In addition, we implemented it in a way that can ignore excess (≥2) whitespace characters in a row. Which is why we have to search for the end of the match and can't just use the length of the search string / quote and the start of the match position: the number of whitespace characters in the search string and search match might differ.

    doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>")
    quote = 'ipsum dolor'
    
    
    # (1) Find search string in document text, "plain text in plain text".
    
    quote_query = 
      quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+')
    
    start_index = doc.text.index(/#{quote_query}/i)
    end_index = start_index+doc.text[/#{quote_query}/i].size
    
    
    # (2) Find XPath values and character indexes for our search match.
    # 
    # To do this, walk through all text nodes and count characters until 
    # encountering both the start_index and end_index character counts 
    # of our search match.
    
    start_xpath, start_offset, end_xpath, end_offset = nil
    i = 0
    
    doc.xpath('.//text() | text()').each do |x|
      offset = 0
      x.text.split('').each do
        if i == start_index
          e = x.previous
          sum = 0
          while e
            sum+= e.text.size
            e = e.previous
          end
          start_xpath = x.path.gsub(/^\?/, '').gsub(
            /#{Regexp.quote('/text()')}.*$/, ''
          )
          start_offset = offset+sum
        elsif i+1 == end_index
          e = x.previous
          sum = 0
          while e
            sum+= e.text.size
            e = e.previous
          end
          end_xpath = x.path.gsub(/^\?/, '').gsub(
            /#{Regexp.quote('/text()')}.*$/, ''
          )
          end_offset = offset+1+sum
        end
        offset+=1
        i+=1
      end
    end
    

    At this point, we can retrieve the desired XPath values for the start and stop of the search match (and in addition, character offsets pointing to the exact character inside the XPath designated element for the start and stop of the search match). We get:

    puts start_xpath
      /div
    puts start_offset
      6
    puts end_xpath
      /div/b
    puts end_offset
      5