Search code examples
phpsubstrstrposscraper

scrape email addresses


fff.html is an email with email addresses in it some have href mailto links and some don't, i want to scrape them and output them into the following format

[email protected],[email protected],[email protected]

I have a simple scraper to get the ones that are href linked but something is wierd

  <?php
    $url = "fff.html";
    $raw = file_get_contents($url);

    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
    $content = str_replace($newlines, "", html_entity_decode($raw));

    $start = strpos($content,'<a href="mailto:');
    $end = strpos($content,'"',$start) + 8;
    $mail = substr($content,$start,$end-$start);

    print "$mail<br />";
    ?>

I should get extra points for the original use of lorem ipsum


Solution

  • The problem is what if you have more than one email address in the HTML page. substr will only return the first instance. Here is a script that will parse all email addresses. You may need to tweak it some for your use. It will output the results in the CSV form you requested.

    <?php
    $url = "fff.html";
    $raw = file_get_contents($url);
    
    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
    $content = str_replace($newlines, "", html_entity_decode($raw));
    
    $start = strpos($content, '<body>');
    $end = strpos($content, '</body>');
    $data = substr($content, $start, $end-$start);
    
    $pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
    preg_match_all($pattern, $data, $matches);
    
    foreach ($matches[1] as $key => $email) {
        $emails[] = $email;
    }
    echo implode(', ', $emails );
    ?>