I'm trying to scrape a site's link texts, i.e SCRAPE THIS. I want to do this for all links on the page. So far I have this:
<?php
$target_url = "SITE I WANT TO SCRAPE";
// make the cURL request to $target_url
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$target_url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
// parse the html into a DOMDocument
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/text()");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
echo "<br />Link stored: $href";
}
?>
I'm pretty new to this stuff and can't figure out what I'm doing wrong?
Thanks!
In your for loop, $href
is not a string. It's actually a DOMText node. In order to use it as a string, you need to access its nodeValue
property.
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
echo "<br />Link stored: $href->nodeValue";
}