Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text
method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.
For example, in IRB if I do the following in Ruby 1.9.2-p290:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)
If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.
Can I ignore JavaScript or is there a better way of doing this?
You could try:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.traverse{ |x|
if x.text? && x.text !~ /^\s*$/
puts x.text
end
}
I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).