Search code examples
phpfilteringinformation-hiding

Automatically removing contact information from documents


Does anybody know of a good solution that can be used from php that will effectively remove contact information like phone numbers, email addresses and maybe even contact addresses from a document?

Update

Hey Guys, here is what I came up with so far, it works pretty well.

function sanitizeContent($content)
    {       
        // emails - even containing white space characters like this 't e s t @ ba d . co m'
        $content = preg_replace('/([A-Za-x-0-9\s\_\.]{1,50})(?=@)@([A-Za-x-0-9\s\_\.]{1,50})/', '[email removed]', $content);       

        // urls
        $content = preg_replace('/[a-zA-Z]*[:\/\/]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i', '[link removed]', $content);

        // phone numbers            
        $content = preg_replace('/(\d)?(\s|-|.|\/)?(\()?(\d){3}(\))?(\s|-|.|\/){1}(\d){3}(\s|-|.|\/){1}(\d){4}/', '[phone removed]', $content);
        $content = preg_replace('/[0-9\.\-\s\,\/(x|ext)]{5,50}/', '[phone removed]', $content);     

        // addresses????

        return $content;
    }

Does anybody have any ideas for addresses, I am thinking maybe come up with a way to detect city, state zip then also strip x chars before that. It could clobber some data accidentally but that might be better than disclosure. I would be really interested to hear if anybody else has run into this.


Solution

  • Use regular expression.

    You can use preg_replace to do it.

    $pattern = "/[a-zA-Z]*[:\/\/]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i";
    $replacement = "[removed]";
    preg_replace($pattern, $replacement, $string);
    

    for emails:

    $pattern = "/[^@\s]*@[^@\s]*\.[^@\s]*/";
    $replacement = "[removed]";
    preg_replace($pattern, $replacement, $string);
    

    for urls: