Search code examples
rubynokogirimechanize-ruby

Using the Ruby Mechanize "links_with" to grab text but getting extra content


When I grab a group of links using the Mechanize links_with method I only want the text showing the link but I'm getting a series of extra characters:

links = @some_page.links_with(text: /V\s.*(BENCH|EARCX)|(BENCH|EARCX).*V/)

links.each do |link|
  link.text
end

The links are shown in my browser as "23409BENCH092834" and "20193BENCH092339" which is exactly what I want however when I go to save them in my database they get saved as

\r\n\t\t\t\t\r\n\t\t\t\t\t 23409BENCH092834\r\n\t\t\t\t\r\n\t\t\t\t

Where did these extra characters come from and what do they represent? I've tried using text and to_s on them but it isn't getting rid of these random characters.


I think they may be escape codes but if so how would I remove them?


Solution

  • You failed to give us example HTML showing the markup you're working against. That makes it very difficult to help you. Don't do that; Help us help you.

    Mechanize uses Nokogiri internally and can return you a Nokogiri document, so you'll want to get that. From there you're in Nokogiri's domain which will give you more control over the searching.

    Using Mechanize's links_with finds all matching links in the document and returns them as an array of Node, AKA NodeSet. Those probably contain a lot of other nodes inside them, which is responsible for the tabs and returns you're seeing. While links_with is useful, you have to always be aware of what something is returning you so you can react to it correctly.

    The problem you're seeing is because you're not accessing the right tag when you extract the text, or the values you say you see in the links isn't exactly what you report.

    Consider this:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(<<EOT)
    <html>
    <body>
    <p>foo</p>
    |
    <p>bar</p>
    </body>
    </html>
    EOT
    

    Extracting text from a higher tag (parent) than the exact one you should will return everything in that parent:

    doc.search('body').text # => "\nfoo\n|\nbar\n"
    

    Notice that it picked up the line-breaks and | that are between tags. That's because text returns all text nodes, not just those inside a child tag. So being explicit about what you want it important.

    Similarly, searching for only the p tags returns all the text found inside them:

    doc.search('p').text # => "foobar"
    

    This also doesn't usually work since text will concatenate all the text in the nodes found in the NodeSet returned by search, which isn't very useful usually.

    Instead, find the specific node you want and get its text:

    doc.at('p').text # => "foo"
    

    at returns the first matching node and is equivalent to search('p').first.

    If you want all the text from the p nodes, then iterate over them:

    doc.search('p').map(&:text) # => ["foo", "bar"]
    

    In more complex documents we often have to find a specific landmark in the hierarchy of tags and navigate to it, then search further, but that's a separate issue.

    Putting all that together, here's a sample that helps visualize what you're encountering and how to deal with it:

    require 'nokogiri'
    
    doc = Nokogiri::HTML(<<EOT)
    <html>
    <body>
      <a href="http://example.com">
        <span class="hubbub">foo</span>
      </a>
      |
      <a href="http://example.com">
        <span class="hubbub">bar</span>
      </a>
    </body>
    </html>
    EOT
    

    Don't do these:

    doc.search('body').text # => "\n  \n    foo\n  \n  |\n  \n    bar\n  \n"
    doc.search('a').text # => "\n    foo\n  \n    bar\n  "
    

    Do these:

    doc.search('a span').map(&:text) # => ["foo", "bar"]
    

    Or:

    spans = doc.search('a').map{ |link|
      link.at('span').text
    }
    spans # => ["foo", "bar"]
    

    The first is a faster because it relies on the libXML2 code to find the matching span nodes defined in the 'a span' CSS selector. The second is slower but is more flexible and allows you to use Ruby's language to iterate and peek into tags.

    See "How to avoid joining all text from Nodes when scraping" also.