Search code examples
phpparsingrsssimplexmlfeed

Trying to Parse Only the Images from an RSS Feed


First, I am a php newbie. I have looked at the question and solution here. For my needs however, the parsing does not go deep enough into the various articles.

A small sampling of my rss feed reads like this:

 <channel>
 <atom:link href="http://mywebsite.com/rss" rel="self" type="application/rss+xml" />
 <title>My Web Site</title>
 <description>My Feed</description>
 <link>http://mywebsite.com/</link>

 <image>
 <url>http://mywebsite.com/views/images/banner.jpg</url>
 <title>My Title</title>
 <link>http://mywebsite.com/</link>
 <description>Visit My Site</description>
 </image>

 <item>
 <title>Article One</title>
 <guid isPermaLink="true">http://mywebsite.com/details/e8c5106</guid>
 <link>http://mywebsite.com/geturl/e8c5106</link>
 <comments>http://mywebsite.com/details/e8c5106#comments</comments>     
 <pubDate>Wed, 09 Jan 2013 02:59:45 -0500</pubDate> 
 <category>Category 1</category>    
 <description>
      <![CDATA[<div>
      <img src="http://mywebsite.com/myimages/1521197-main.jpg" width="120" border="0"  />  
      <ul><li>Poster: someone's name;</li>
      <li>PostDate: Tue, 08 Jan 2013 21:49:35 -0500</li>
      <li>Rating: 5</li>
      <li>Summary:Lorem ipsum dolor </li></ul></div><div style="clear:both;">]]>
      </description>
 </item> 
 <item>..

The image links that I want to parse out are the ones way inside each Item > Description

The code in my php file reads:

     <?php
 $xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
 $imgs = $xml->xpath('/item/description/img');
 foreach($imgs as $image) {
      echo $image->src;
 }
 ?>

Can someone please help me figure out how to configure the php code above?

Also a very newbie question... once I get the resulting image urls, how can I display the images in a row on my html?

Many thanks!!!

Hernando


Solution

  • The <img> tags inside that RSS feed are not actually elements of the XML document, contrary to the syntax highlighting on this site - they are just text inside the <description> element which happen to contain the characters < and >.

    The string <![CDATA[ tells the XML parser that everything from there until it encounters ]]> is to be treated as a raw string, regardless of what it contains. This is useful for embedding HTML inside XML, since the HTML tags wouldn't necessarily be valid XML. It is equivalent to escaping the whole HTML (e.g. with htmlspecialchars) so that the <img> tags would look like &lt;img&gt;. (I went into more technical details on another answer.)

    So to extract the images from the RSS requires two steps: first, get the text of each <description>, and second, find all the <img> tags in that text.

    $xml = simplexml_load_file('http://mywebsite.com/rss?t=2040&dl=1&i=1&r=ceddfb43483437b1ed08ab8a72cbc3d5');
    
    $descriptions = $xml->xpath('//item/description');
    foreach ( $descriptions as $description_node ) {
        // The description may not be valid XML, so use a more forgiving HTML parser mode
        $description_dom = new DOMDocument();
        $description_dom->loadHTML( (string)$description_node );
    
        // Switch back to SimpleXML for readability
        $description_sxml = simplexml_import_dom( $description_dom );
    
        // Find all images, and extract their 'src' param
        $imgs = $description_sxml->xpath('//img');
        foreach($imgs as $image) {
            echo (string)$image['src'];
        }
    }