Search code examples
rubyxpathnokogiribenchmarking

Why finding node with desired text is faster with ruby than with xpath?


Recently I had to check if html nodes contain desired text. I was surprised that when I refactored the code to use xpath selectors it became 10x slower. There is the simplified version of the original code with the benchmark

# has_keyword_benchmark.rb
require 'benchmark'
require 'nokogiri'

Doc = Nokogiri("
<div>
  <div>
    A
  </div>
  <p>
    <b>A</b>
  </p>
  <span>
    B
  </span>
</div>")

def has_keywords_with_xpath
  Doc.xpath('./*[contains(., "A")]').size > 0
end

def has_keywords_with_ruby
  Doc.text.include? 'A'
end

iterations = 10_000
Benchmark.bm(27) do |bm|
  bm.report('checking if has keywords with xpath') do
    iterations.times do
      has_keywords_with_xpath
    end
  end

  bm.report('checking if has keywords with ruby') do
    iterations.times do
      has_keywords_with_ruby
    end
  end
end

when I run ruby has_keyword_benchmark.rb I get

                                  user     system      total        real
checking if has keywords with xpath  0.400000   0.020000   0.420000 (  0.428484)
checking if has keywords with ruby  0.020000   0.000000   0.020000 (  0.023773)

Intuitively checking if node has some text should be faster with xpath but it is not. Does anybody have an idea why?


Solution

  • Typically parsing and compiling of an XPath expression takes much longer than actually executing it, even on quite a large document. For example, with Saxon, running the expression count(//*[contains(., 'e')]) against a 1Mb source document, compiling the path expression takes 200ms, while executing it takes around 18ms.

    If your XPath API allows you to compile an XPath expression once and then execute it repeatedly (or if it caches the compiled expression behind the scenes) then it's definitely worth taking advantage of that capability.

    The actual XPath execution is likely to be at least as fast as your hand-written navigation code, possibly rather faster. It's the preparation that causes the overhead.