Search code examples
pythonpython-3.xregexemailpython-re

Remove e-mail address with whitespaces


I am working with call centre transcripts. In an ideal case the speech-to-text software will transcribe an e-mail as follows: [email protected]. This will not always be the case. So I am looking at a regular expression (RegEx) solution that accommodates whitespaces in the e-mail address, e.g. maya [email protected] or maya.lucco @proton.me or maya-lucco@pro ton.me

I have unsuccessfully tried to extend this solution with regex101. Compiling a re object (pattern) as suggested in this solution seems overly complex for the task. I looked at the posts on validating e-mail addresses but they describe a different issue. Below my code so far:

import re 

#creating some data 
test = ['some random text maya @ proton.me with some more text [email protected]',
        '[email protected] with another address [email protected]',
        'some text maya.lucco @proton.me with some more bla [email protected]',
        '[email protected] more text maya@ proton.me '
        ]
        
test = pd.DataFrame(test, columns = ['words'])

#creating a function because I like to add some other data cleaning to it later on
def anonymiseEmail(text):
    
    text = str(text) #make text as string variable
    text = text.strip() #remove any leading, and trailing whitespaces
    text = re.sub(r'\S*@\S*\s?', '{e-mail}', text) #remove e-mail address
    
    return text

# applying the function
test['noEmail'] = test.words.apply(anonymiseEmail)

#checking the results
print(test.noEmail[0])
Output: some random text maya {e-mail}proton.me with some more text {e-mail}

The first e-mail address does not get fully deleted. The firstname Maya remains. This is a problem for the project.

How can the code be extended so that the whole e-mail address, regardless of how many whitespaces it has, be replaced with a placed holder or deleted?

Up-date following the comments:

I have looked into RegEx lookahead and lookbehind, i.e. (?=@) and (?<=@) but can't seem to make it match the word(s) that proceed or succeed the @-sign. I am looking at a code snippet Wiktor Stribiżew kindly provided on another occasion \b(?:Dear|H(?:ello|i))(?:[^\S\r\n]+[A-Z]\w*(?:[’'-]\w+)*\.?)+''', '', text and thought I could up-date it to (?i)\b(?<=@)(?:[^\S\r\n]+[A-Z]\w*(?:[’'-]\w+)*\.?)+ but it doesn't match any e-mail address according to regex101. Maybe the code snippet (?i)\b(?<=@) (or any other RegEx) can be modified to match the proceeding and/or succeeding word(s)?

Another possible solution that comes to my mind is to select the 5 words before and after the @-sign, put them in a separate variable, check if there are any whitespaces 4 letters/characters before and after the @-sign. If yes, put them in a queue for a manual check. What concerns me with this solution is a) computing power, b) technical implementation and c) general feasibility. But I thought I'd share it in the spirit of trying to find a solution.


Solution

  • Here's a tweaked regular expression for recognising the email addresses like maya @ proton.me.

    import pandas as pd
    
    def anonymiseEmail(text):
        email_regex = r"\b[a-zA-Z0-9._%+-]+\s*@\s*[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
        return re.sub(email_regex, "{e-mail}", str(text).strip())
    
    
    lines = [
        "some random text maya @ proton.me with some more text [email protected]",
        "[email protected] with another address [email protected]",
        "some text maya.lucco @proton.me with some more bla [email protected]",
        "[email protected] more text maya@ proton.me "
        ]
    
    sample = pd.DataFrame(columns=["Lines"], data=lines)
    sample["NoEmail"] = sample.Lines.apply(anonymiseEmail)
    
    print(sample.NoEmail)
    

    Output:

    0    some random text {e-mail} with some more text ...
    1               {e-mail} with another address {e-mail}
    2       some text {e-mail} with some more bla {e-mail}
    3                          {e-mail} more text {e-mail}
    Name: NoEmail, dtype: object