Search code examples
phpsimple-html-domsimplepie

Using SimplePie and Simple HTML DOM together


I'm trying to use SimplePie to pull a list of links via RSS feeds and then scrape those feeds using Simple HTML DOM to pull out images. I'm able to get SimplePie working to pull the links and store them in an array. I can also also use the Simple HTML DOM parser to get the image link that I'm looking for. The problem is that when I try to use SimplePie and Simple HTML DOM at the same time, I get a 500 error. Here's the code:

set_time_limit(0);
error_reporting(0);

$rss = new SimplePie();
$rss->set_feed_url('http://contently.com/strategist/feed/');
$rss->init();

foreach($rss->get_items() as $item)
  $urls[] = $item->get_permalink();
unset($rss);

/*
$urls = array(
'https://contently.com/strategist/2016/01/22/whats-in-a-spotify-name-and-5-other-stories-you-should-read/',
'https://contently.com/strategist/2016/01/22/how-to-make-content-marketing-work-inside-a-financial-services-company/',
'https://contently.com/strategist/2016/01/22/glenn-greenwald-talks-buzzfeed-freelancing-the-future-journalism/',
...
'https://contently.com/strategist/2016/01/19/update-a-simpler-unified-workflow/');
*/ 

foreach($urls as $url) {
  $html = new simple_html_dom();
  $html->load_file($url);
  $images = $html->find('img[class=wp-post-image]',0);
  echo $images;
  $html->clear();
  unset($html);
}

I commented out the urls array, but it is identical to the array created by the SimplePie loop (I created it manually from the results). It fails on the find command the first time through the loop. If I comment out the $rss->init() line and use the static url array, the code all runs with no errors, but doesn't give me the result I want - of course. Any help is greatly appreciated!


Solution

  • There's a strange incompatibility between simple_html_dom and SimplePie. Loading html, the simple_html_dom->root is not loaded, causing error for any other operation.

    Curiously, passing to function-mode instead of object-mode, for me it works fine:

    $html = file_get_html( $url );
    

    instead of:

    $html = new simple_html_dom();
    $html->load_file($url);
    

    Anyway, simple_html_dom is is known for causing problems, above all about memory usage.

    Edited:

    OK, I have found the bug. It reside on simple_html_dom->load_file(), that call standard function file_get_contents() and then check the result through error_get_last() and - if error was found - unset this own data. But if an error has occurred before (in my test SimplePie output a warning ./cache is not writeable) this previously error is interpreted by simple_html_dom as file_get_contents() fail.

    If you have PHP 7 installed, you can call error_clear_last() after unset($rss), and your code should be work. Otherwise, you can use my code above or pre-load html data to a variable and then call simple_html_dom->load() instead of simple_html_dom->load_file()