I am working with call centre transcripts. In an ideal case the speech-to-text software will transcribe an e-mail as follows: [email protected]. This will not always be the case. So I am looking at a regular expression (RegEx) solution that accommodates whitespaces in the e-mail address, e.g. maya [email protected] or maya.lucco @proton.me or maya-lucco@pro ton.me
I have unsuccessfully tried to extend this solution with regex101. Compiling a re object (pattern) as suggested in this solution seems overly complex for the task. I looked at the posts on validating e-mail addresses but they describe a different issue. Below my code so far:
import re
#creating some data
test = ['some random text maya @ proton.me with some more text [email protected]',
'[email protected] with another address [email protected]',
'some text maya.lucco @proton.me with some more bla [email protected]',
'[email protected] more text maya@ proton.me '
]
test = pd.DataFrame(test, columns = ['words'])
#creating a function because I like to add some other data cleaning to it later on
def anonymiseEmail(text):
text = str(text) #make text as string variable
text = text.strip() #remove any leading, and trailing whitespaces
text = re.sub(r'\S*@\S*\s?', '{e-mail}', text) #remove e-mail address
return text
# applying the function
test['noEmail'] = test.words.apply(anonymiseEmail)
#checking the results
print(test.noEmail[0])
Output: some random text maya {e-mail}proton.me with some more text {e-mail}
The first e-mail address does not get fully deleted. The firstname Maya remains. This is a problem for the project.
How can the code be extended so that the whole e-mail address, regardless of how many whitespaces it has, be replaced with a placed holder or deleted?
Up-date following the comments:
I have looked into RegEx lookahead and lookbehind, i.e. (?=@)
and (?<=@)
but can't seem to make it match the word(s) that proceed or succeed the @-sign. I am looking at a code snippet Wiktor Stribiżew kindly provided on another occasion \b(?:Dear|H(?:ello|i))(?:[^\S\r\n]+[A-Z]\w*(?:[’'-]\w+)*\.?)+''', '', text
and thought I could up-date it to (?i)\b(?<=@)(?:[^\S\r\n]+[A-Z]\w*(?:[’'-]\w+)*\.?)+
but it doesn't match any e-mail address according to regex101. Maybe the code snippet (?i)\b(?<=@)
(or any other RegEx) can be modified to match the proceeding and/or succeeding word(s)?
Another possible solution that comes to my mind is to select the 5 words before and after the @-sign, put them in a separate variable, check if there are any whitespaces 4 letters/characters before and after the @-sign. If yes, put them in a queue for a manual check. What concerns me with this solution is a) computing power, b) technical implementation and c) general feasibility. But I thought I'd share it in the spirit of trying to find a solution.
Here's a tweaked regular expression for recognising the email addresses like maya @ proton.me
.
import pandas as pd
def anonymiseEmail(text):
email_regex = r"\b[a-zA-Z0-9._%+-]+\s*@\s*[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
return re.sub(email_regex, "{e-mail}", str(text).strip())
lines = [
"some random text maya @ proton.me with some more text [email protected]",
"[email protected] with another address [email protected]",
"some text maya.lucco @proton.me with some more bla [email protected]",
"[email protected] more text maya@ proton.me "
]
sample = pd.DataFrame(columns=["Lines"], data=lines)
sample["NoEmail"] = sample.Lines.apply(anonymiseEmail)
print(sample.NoEmail)
Output:
0 some random text {e-mail} with some more text ...
1 {e-mail} with another address {e-mail}
2 some text {e-mail} with some more bla {e-mail}
3 {e-mail} more text {e-mail}
Name: NoEmail, dtype: object