fff.html is an email with email addresses in it some have href mailto links and some don't, i want to scrape them and output them into the following format
[email protected],[email protected],[email protected]
I have a simple scraper to get the ones that are href linked but something is wierd
<?php
$url = "fff.html";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<a href="mailto:');
$end = strpos($content,'"',$start) + 8;
$mail = substr($content,$start,$end-$start);
print "$mail<br />";
?>
I should get extra points for the original use of lorem ipsum
The problem is what if you have more than one email address in the HTML page. substr will only return the first instance. Here is a script that will parse all email addresses. You may need to tweak it some for your use. It will output the results in the CSV form you requested.
<?php
$url = "fff.html";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content, '<body>');
$end = strpos($content, '</body>');
$data = substr($content, $start, $end-$start);
$pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
preg_match_all($pattern, $data, $matches);
foreach ($matches[1] as $key => $email) {
$emails[] = $email;
}
echo implode(', ', $emails );
?>