I am trying to scrape all the urls on the home page on my client's site so I can migrate it to wordpress. The problem is I can't seem to arrive at a de-duplicated list of urls.
Here's the code:
$html = file_get_contents('http://www.catwalkyourself.com');
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
$urls = $matches[0][0][0];
$list = implode( ', ', array_unique( explode(", ", $urls) ) );
echo $list . '<br/>';
//print_r($list);
}
}
(Also posted here.)
Instead I am getting duplicates like this:
http://www.catwalkyourself.com/rss.php
http://www.catwalkyourself.com/rss.php
How do I fix this?
The last part of your code shouldn't be in the loop. You're traversing an array containing every links on the page. As each element of this array contains only one link, you're applying array_unique
on an array which can't contain more than one element.
Try something like this:
$html = file_get_contents('http://www.catwalkyourself.com');
$dom = new DOMDocument();
@$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$urls = array();
for ($i = 0; $i < $hrefs->length; $i++) {
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
if($url = preg_match_all('((www|http://)(www)?.catwalkyourself.com\/?.*)', $url, $matches[0])){
$urls[] = $matches[0][0][0];
}
}
$list = implode(', ', array_unique($urls));
echo $list . '<br/>';