I am attempting to scrape a website using the DOMXPath query method. I have successfully scraped the 20 profile URLs of each News Anchor from this page.
$url = "http://www.sandiego6.com/about-us/meet-our-team";
$xPath = "//p[@class='bio']/a/@href";
$html = new DOMDocument();
@$html->loadHtmlFile($url);
$xpath = new DOMXPath( $html );
$nodelist = $xpath->query($xPath);
$profileurl = array();
foreach ($nodelist as $n){
$value = $n->nodeValue;
$profileurl[] = $value;
}
I used the resulting array as the URL to scrape data from each of the News Anchor's bio pages.
$imgurl = array();
for($z=0;$z<$elementCount;$z++){
$html = new DOMDocument();
@$html->loadHtmlFile($profileurl[$z]);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query("//img[@class='photo fn']/@src");
foreach($nodelist as $n){
$value = $n->nodeValue;
$imgurl[] = $value;
}
}
Each News Anchor profile page has 6 xPaths I need to scrape (the $imgurl array is one of them). I am then sending this scraped data to MySQL.
So far, everything works great - except when I attempt to get the Twitter URL from each profile because this element isn't found on every News Anchor profile page. This results in MySQL receiving 5 columns with 20 full rows and 1 column (twitterurl) with 18 rows of data. Those 18 rows are not lined up with the other data correctly because if the xPath doesn't exist, it seems to be skipped.
How do I account for missing xPaths? Looking for an answer, I found someone's statement that said, "The nodeValue can never be null because without a value, the node wouldn't exist." That being considered, if there is no nodeValue, how can I programmatically recognize when these xPaths don't exist and fill that iteration with some other default value before it loops through to the next iteration?
Here's the query for the Twitter URLs:
$twitterurl = array();
for($z=0;$z<$elementCount;$z++){
$html = new DOMDocument();
@$html->loadHtmlFile($profileurl[$z]);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query("//*[@id='bio']/div[2]/p[3]/a/@href");
foreach($nodelist as $n){
$value = $n->nodeValue;
$twitterurl[] = $value;
}
}
Since the twitter node appears zero or one times, change the foreach to
$twitterurl [] = $nodelist->length ? $nodelist->item(0)->nodeValue : NULL;
That will keep the contents in sync. You will, however, have to make arrangements to handle NULL values in the query you use to insert them in the database.