Search code examples
rubynokogiri

Nokogiri grab only visible inner_text


Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.

For example, in IRB if I do the following in Ruby 1.9.2-p290:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)

If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.

Can I ignore JavaScript or is there a better way of doing this?


Solution

  • You could try:

    require 'nokogiri'
    require 'open-uri'
    
    doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
    
    doc.traverse{ |x|
        if x.text? && x.text !~ /^\s*$/
            puts x.text
        end
    }
    

    I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).