Search code examples
phpregexhtml-parsing

PHP Get unsubscribe URL from email body


I have an email's HTML body. I need to parse just the unsubscribe link from that. So if at any point in the dom there is some kind of link, containing the word Unsubscribe, I would need to return the URL of that specific link. I tried different regex but I can't seem to find just the unsubscribe URL or sometimes at all.

$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*(?:unsubscribe).*)<\/a>";
preg_match_all("/$regexp/iU", $body, $matches);
var_dump($matches);

This does not work :/

Thanks


Solution

  • You can use DOMXpath and check if the anchor contains a case insensitive match for unsubscribe and get the url using getAttribute to get the value for the href.

    $data = <<<DATA
    This is a link <a href="https://stackoverflow.com/">SO</a> and this is <a href="http://test.test">unsubscribe</a> and 
    another and this is <a href="http://test.test">UnSubScribe</a>.
    DATA;
    
    $dom = new DomDocument();
    $dom->loadHTML($data);
    $xpath = new DOMXPath($dom);
    $query = "//a[contains(translate(., 'UNSUBSCRIBE', 'unsubscribe'),'unsubscribe')]";
    $anchors = $xpath->query($query);
    
    foreach ($anchors as $a) {
        echo sprintf("%s: %s" . PHP_EOL,
            $a->nodeValue,
            $a->getAttribute("href")
        );
    }
    

    Output

    unsubscribe: http://test.test
    UnSubScribe: http://test.test
    

    See a PHP demo.