Recently I had to check if html nodes contain desired text. I was surprised that when I refactored the code to use xpath selectors it became 10x slower. There is the simplified version of the original code with the benchmark
# has_keyword_benchmark.rb
require 'benchmark'
require 'nokogiri'
Doc = Nokogiri("
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>")
def has_keywords_with_xpath
Doc.xpath('./*[contains(., "A")]').size > 0
end
def has_keywords_with_ruby
Doc.text.include? 'A'
end
iterations = 10_000
Benchmark.bm(27) do |bm|
bm.report('checking if has keywords with xpath') do
iterations.times do
has_keywords_with_xpath
end
end
bm.report('checking if has keywords with ruby') do
iterations.times do
has_keywords_with_ruby
end
end
end
when I run ruby has_keyword_benchmark.rb
I get
user system total real
checking if has keywords with xpath 0.400000 0.020000 0.420000 ( 0.428484)
checking if has keywords with ruby 0.020000 0.000000 0.020000 ( 0.023773)
Intuitively checking if node has some text should be faster with xpath but it is not. Does anybody have an idea why?
Typically parsing and compiling of an XPath expression takes much longer than actually executing it, even on quite a large document. For example, with Saxon, running the expression count(//*[contains(., 'e')])
against a 1Mb source document, compiling the path expression takes 200ms, while executing it takes around 18ms.
If your XPath API allows you to compile an XPath expression once and then execute it repeatedly (or if it caches the compiled expression behind the scenes) then it's definitely worth taking advantage of that capability.
The actual XPath execution is likely to be at least as fast as your hand-written navigation code, possibly rather faster. It's the preparation that causes the overhead.