Search code examples
htmlrubyhtml-parsingnokogirigsub

Issue with gsub method in my Ruby code when trying to replace HTML <a> tags with the URL stripped from in it


I am trying to achieve a basic substitution but I am finding it difficult to determine the behaviour here.

I want to replace the tags with the URL contained inside it.

This is my code:

require 'nokogiri'

message = "Hi Testin wFAASF,
Thank you for booking with us.
Your work has been booked on Sep 16, 2020 1:00PM at 2026 South Clark Street / unit c / Chicago, Illinois 60616
Sincerely,
Varun Security
<a href=\"https://www.google.com\">Test This PR</a>"

puts message.gsub(Nokogiri::HTML.parse(message).at('a'), Nokogiri::HTML.parse(message).at('a')['href'])

What I think the output would be:

"Hi Testin wFAASF,
Thank you for booking with us.
Your work has been booked on Sep 16, 2020 1:00PM at 2026 South Clark Street / unit c / Chicago, Illinois 60616
Sincerely,
Varun Security
https://www.google.com

What the actual output is:

"Hi Testin wFAASF,
Thank you for booking with us.
Your work has been booked on Sep 16, 2020 1:00PM at 2026 South Clark Street / unit c / Chicago, Illinois 60616
Sincerely,
Varun Security
<a href=\"https://www.google.com\">https://www.google.com</a>"

Could someone explain why this is happening and how I could do this better?


Solution

  • Because Nokogiri::XML::Element is neither a string nor a regexp. Sticking .to_s works:

    puts message.gsub(
        Nokogiri::HTML.parse(message).at('a').to_s, 
        Nokogiri::HTML.parse(message).at('a')['href']
    )
    

    However, you are going to all the trouble of parsing the HTML just to search the document again as if you didn't know anything about it. Also, it will give a wrong result if you have multiple links in one message, or if your anchor tag is not formatted canonically — e.g. if you have an extra space, like this: <a href="https://www.google.com" >https://www.google.com</a>

    Why not let Nokogiri work?

    puts Nokogiri::HTML.fragment(message).tap { |doc|
      doc.css("a").each { |node|
        node.replace(node["href"])
      }
    }.to_html
    

    Note that I changed Nokogiri::HTML.fragment, since this is not a full HTML document (with doctype and all), which Nokogiri would feel obligated to add. Then, for each anchor node, replace it with the value of its href attribute.