Search code examples
rubynokogiriopen-uri

How to get Nokogiri inner_HTML object to ignore/remove escape sequences


Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?

require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://the.page.url.com"))

page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html

this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"

What is the most effective and direct nokogiri (or ruby) way of doing this?


Solution

  • page.at_css("td[custom-attribute='foo']")
        .parent
        .css('td')
        .css('a')
        .text               # since you need a text, not inner_html
        .strip              # this will strip a result
    

    String#strip.

    Sidenote: css('td a') is likely more efficient than css('td').css('a').