I am using Feedjira to parse some RSS feeds and I am getting data back that looks like:
<table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNEnMLee_eB0lY7hCtIqJCf8Iy2StQ&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778768548994&ei=xaUHVaj4GcLBmQLyjIDIDw&url=http://www.foxnews.com/weather/2015/03/15/cyclone-pam-vanuatu/?intcmp%3Dlatestnews"><img src="//t1.gstatic.com/images?q=tbn:ANd9GcTHyV7D2Zf-QfzLZ-7qJlk0mE3nU7qM3-mnENtJPURJTk8o9Kh-Iqc_focHCHAALYhnRuY1Nop6" alt="" border="1" width="80" height="80"><br><font size="-2">Fox News</font></a></font></td><td valign="top" class="j"><font style="font-size:85%;font-family:arial,sans-serif"><br><div style="padding-top:0.8em;"><img alt="" height="1" width="1"></div><div class="lh"><a href="http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHTAYRk1bcvBCJxvZ4M0OUUrXTXQg&clid=c3a7d30bb8a4878e06b80cf16b898331&cid=52778768548994&ei=xaUHVaj4GcLBmQLyjIDIDw&url=http://www.dailymail.co.uk/wires/reuters/article-2997951/Aid-agencies-begin-helicopter-flights-cyclone-stricken-Vanuatu.html"><b>Aid agencies begin...
I'm trying to just strip everything except the img src so I can display it on my page in an img tag next to the article text. I'm using Ryan Grove's Sanitize Gem in the following way:
<img class="media-object" src="<%= Sanitize.fragment(entry.content, :elements => ['img'], :attributes => { 'img' => ['src']}) %>" alt="..." style="width:72px;height:72px">
However, this is inserting the following in my html:
<img class="media-object" src="<img src="//t1.gstatic.com/images?q=tbn:ANd9GcTHyV7D2Zf-QfzLZ-7qJlk0mE3nU7qM3-mnENtJPURJTk8o9Kh-Iqc_focHCHAALYhnRuY1Nop6">Fox News <img> Aid agencies begin flights to cyclone-stricken Vanuatu, official toll lowered Daily Mail TANNA, March 17 (Reuters) - International aid agencies began emergency flights on Tuesday to some of the remote outer islands of Vanuatu, which they fear have been devastated by a monster cyclone that tore through the South Pacific island nation. Relief, hardship as Cyclone Pam survivors battle onBangkok Post UN says 24 dead in Vanuatu after Cyclone Pam7Online WSVN-TV Fears for food supplies in Vanuatu as capital cleans upThe Star Online Xinhua -MSNBC -Bloomberg all 4,389 news articles » " alt="..." style="width:72px;height:72px">
Any ideas how I can just get that src link and not everything else?
Appreciate any assistance!
For very simple HTML parsing like this, a regular expression is simple and reliable. For example,
feedjira_output =~ /src="([^"]+)"/
This puts the source url in a regex group (accessible via the $1
variable).