Search code examples
phpregexobfuscationemail-validation

How can I catch the following obfuscated email addresses in PHP?


Consider the following script that contains obfuscated email addresses, and a function that attempts to replace them based with ***** by using regex pattern matching. My script attempts to catch the words: "at", "a t", "a.t", "@" followed by some text (any domain name), followed by "dot" "." "d.o.t", followed by a TLD.

Input:

$str[] = 'dsfatasdfasdf asd dsfasdf [email protected]'; 
$str[] = 'I live at school where My address is [email protected]'; 
$str[] = 'I live at school. My address is [email protected]'; 
$str[] = 'at school my address is [email protected]'; 
$str[] = 'dsf a t asdfasdf asd dsfasdf [email protected]'; 
$str[] = 'd s f d s f a t h o t m a i l . c o m';

function clean_text($text){
    $pattern = '/(\ba[ \.\-_]*t\b|@)[ \.\-_]*(.+)[ \.\-_]*(d[ \.\-_]*o[ \.\-_]*t|\.)[ \.\-_]*(c[ \.\-_]*o[ \.\-_]*m|n[ \.\-_]*e[ \.\-_]*t|o[ \.\-_]*r[ \.\-_]*g|([a-z][ \.\-_]*){2,3}[a-z]?)/iU'; 
    return preg_replace($pattern, '***', $text); 
}

foreach($str as $email){ 
     echo clean_text($email); 
}

Expected Output:

dsfatasdfasdf asd dsfasdf dsfdsf*** 
I live at school where My address is dsfdsf@***
I live at school. My address is dsfdsf@***
*** 
dsf *** 
d s f d s f *** 

Result:

dsfatasdfasdf asd dsfasdf dsfdsf*** 
I live *** 
I live *** 
at school my address is dsfdsf****
dsf *** 
d s f d s f *** 

Problem: It catches the first occurrence of "at", and not the last, so the following happens:

input: 'at school my address is [email protected]'
produces: '****'
should produce: 'at school my address is dsfdsf****'

How can I fix this?


Solution

  • Based on M42's regex:

    Code:

    $emails = array(
                    'dsfatasdfasdf asd dsfasdf [email protected]'
                    ,'I live at school where My address is [email protected]'
                    ,'I live at school. My address is [email protected]'
                    ,'at school my address is [email protected]'
                    ,'dsf a t asdfasdf asd dsfasdf [email protected]'
                    ,'d s f d s f a t h o t m a i l . c o m'
                    );
    
    foreach($emails as $email)
    {
        $found = preg_match('/(.*?)((\@|a[_. -]*t)[\w .-]*?$)/', $email, $matches);
        if($found)
        {
            echo 'Username: ' . $matches[1] . ', Domain: ' . $matches[2] . "\n";
        }
    }
    

    Output:

    Username: dsfatasdfasdf asd dsfasdf dsfdsf, Domain: @hotmail.com
    Username: I live at school where My address is dsfdsf, Domain: @hotmail.com
    Username: I live at school. My address is dsfdsf, Domain: @hotmail.com
    Username: at school my address is dsfdsf, Domain: @hotmail.com
    Username: dsf a t asdfasdf asd dsfasdf dsfdsf, Domain: @hotmail.com
    Username: d s f d s f , Domain: a t h o t m a i l . c o m