I am trying to scrape this url https://nrg91.gr/nrg-airplay-chart/ using simple-html-dom, but it does not seem to get the full html source code. This code:
include_once('simple_html_dom.php');
$html = file_get_html('https://nrg91.gr/nrg-airplay-chart');
echo $html->plaintext;
displays the content up to the h1, just before the content I am after. And from the simple-html-dom manual examples, this should display all links from that url:
foreach($html->find('a') as $e)
echo $e->href . '<br>';
but it only displays the links up to the main navigation menu, not from the main body or footer.
I also tried using prerender.com, to fully load url before passing it to file_get_html but the result was the same. What am I doing wrong?
Here's my super dirty approach to fetching the rank/artist/title/youtube data using both DOMDocument and SimpleXML.
The concept is to locate each "row" of data via the xpath //ul[@id="chart_ul"]/li
, then using dom_import_simplexml( $outer )->getNodePath()
to build a new xpath to select the individual elements where the desired data can be located.
$temp = sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'nrg-airplay-chart.html';
if( file_exists( $temp ) === false or filemtime( $temp ) < time() - 3600 )
{
file_put_contents( $temp, $html = file_get_contents('https://nrg91.gr/nrg-airplay-chart/') );
}
else
{
$html = file_get_contents( $temp );
}
$dom = new DOMDocument();
$dom->loadHTML( $html );
$xml = simplexml_import_dom( $dom );
$array = array();
foreach( $xml->xpath('//ul[@id="chart_ul"]/li') as $index => $set )
{
$basexpath = dom_import_simplexml( $set )->getNodePath();
$array[] = array(
'ranking' => (string) $xml->xpath( $basexpath . '//span[@id="ranking"]' )[0],
'artist' => (string) $xml->xpath( $basexpath . '//p[@id="artist"]/b' )[0],
'title' => (string) $xml->xpath( $basexpath . '//p[@id="title"]' )[0],
'youtube' => (string) $xml->xpath( $basexpath . '//div[@id="media"]/a/@href' )[0],
);
}
print_r( $array );