Search code examples
ruby-on-railsimagesanitize

Trying to Sanitize HTML Fragment in Rails to only get what's between the image tags so I can display in an image


I am using Feedjira to parse some RSS feeds and I am getting data back that looks like:

<table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNEnMLee_eB0lY7hCtIqJCf8Iy2StQ&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52778768548994&amp;ei=xaUHVaj4GcLBmQLyjIDIDw&amp;url=http://www.foxnews.com/weather/2015/03/15/cyclone-pam-vanuatu/?intcmp%3Dlatestnews"><img src="//t1.gstatic.com/images?q=tbn:ANd9GcTHyV7D2Zf-QfzLZ-7qJlk0mE3nU7qM3-mnENtJPURJTk8o9Kh-Iqc_focHCHAALYhnRuY1Nop6" alt="" border="1" width="80" height="80"><br><font size="-2">Fox News</font></a></font></td><td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br><div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div><div class="lh"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;ct2=us&amp;usg=AFQjCNHTAYRk1bcvBCJxvZ4M0OUUrXTXQg&amp;clid=c3a7d30bb8a4878e06b80cf16b898331&amp;cid=52778768548994&amp;ei=xaUHVaj4GcLBmQLyjIDIDw&amp;url=http://www.dailymail.co.uk/wires/reuters/article-2997951/Aid-agencies-begin-helicopter-flights-cyclone-stricken-Vanuatu.html"><b>Aid agencies begin...

I'm trying to just strip everything except the img src so I can display it on my page in an img tag next to the article text. I'm using Ryan Grove's Sanitize Gem in the following way:

<img class="media-object" src="<%= Sanitize.fragment(entry.content, :elements => ['img'], :attributes => { 'img' => ['src']}) %>" alt="..." style="width:72px;height:72px">

However, this is inserting the following in my html:

<img class="media-object" src="<img src="//t1.gstatic.com/images?q=tbn:ANd9GcTHyV7D2Zf-QfzLZ-7qJlk0mE3nU7qM3-mnENtJPURJTk8o9Kh-Iqc_focHCHAALYhnRuY1Nop6">Fox News <img>  Aid agencies begin flights to cyclone-stricken Vanuatu, official toll lowered Daily Mail TANNA, March 17 (Reuters) - International aid agencies began emergency flights on Tuesday to some of the remote outer islands of Vanuatu, which they fear have been devastated by a monster cyclone that tore through the South Pacific island nation. Relief, hardship as Cyclone Pam survivors battle onBangkok Post UN says 24 dead in Vanuatu after Cyclone Pam7Online WSVN-TV Fears for food supplies in Vanuatu as capital cleans upThe Star Online Xinhua -MSNBC -Bloomberg all 4,389 news articles » " alt="..." style="width:72px;height:72px">

Any ideas how I can just get that src link and not everything else?

Appreciate any assistance!


Solution

  • For very simple HTML parsing like this, a regular expression is simple and reliable. For example,

    feedjira_output =~ /src="([^"]+)"/
    

    This puts the source url in a regex group (accessible via the $1 variable).