Search code examples
ruby-on-railsregexrubysanitize

Extract the mailto value and remove html tag if any in the string


I want to extract the mailto value from the given string and also want to remove the html tag at the same time.

ex -> "<mailto:[email protected]|[email protected]> helo<p> bye </p>"
output -> [email protected] helo bye

If I use this -> gsub(/<[^>]*>/,'')
output -> helo bye

If I use this -> ActionView::Base.full_sanitizer.sanitize(html_string, :tags => %w(img br p), :attributes => %w(src style))
output -> helo bye

Can you suggest me how can i get my expected output?
expected output -> [email protected] helo bye


Solution

  • The probem is that the mailto value is inside HTML tags, so when you remove the HTML tags, you remove the mailto value as well. It is definitely possible to construct a complex regular expression that would handle it, but I think it's much easier to extract the mailto value separately from the rest of the string. I would do this with a capturing group that extracts the value between "mailto:" and "|". Then you can get the rest of of the output value by processing the full string with the gsub method you already have.

    s = "<mailto:[email protected]|[email protected]> helo<p> bye </p>"
    
    # Find the "mailto" value
    s.match(/mailto:([^|]*)/)
    => #<MatchData "mailto:[email protected]" 1:"[email protected]">
    
    # Full result with the matched email and the rest of the string with HTML tags removed
    s.match(/mailto:([^|]*)/)[1] + s.gsub(/<[^>]*>/, "")
    => "[email protected] helo bye "
    

    If the string starts with something other than the <mailto> tag, you could replace the whole tag with just the matched email address and then get rid of the other tags after that:

    s = "this is <mailto:[email protected]|[email protected]> helo<p> bye </p>"
    
    # Replace mailto tag with the email, then process the rest
    # '\1' is a backreference to the first match
    s.gsub(/<mailto:([^|]*)[^>]*>/, '\1').gsub(/<[^>]*>/, "")
    => "this is [email protected] helo bye "
    
    # Alternatively, you can just process the mailto tag differently in the gsub block
    s.gsub(/<[^>]*>/) do |tag|
      tag.include?("mailto:") ? tag.match(/mailto:([^|]*)/)[1] : ""
    end
    => "this is [email protected] helo bye "