Hi Im cleaning bunch of emails and need help of removing data on each cell.In my data frame it looks like this
I want to remove the duplicated values on each value on each cell with multiple row duplicates too.
My data
left | center |
---|---|
One | email1emailcom email2emailcom email3emailcom |
Two | email1emailcom email15emailcom email12emailcom |
Three | email15emailcom email16emailcom |
Output
left | center |
---|---|
One | email1emailcom email2emailcom email3emailcom |
Two | email15emailcom email12emailcom |
Three | email16emailcom |
If you want to keep a running track of emails you have already seen, then you likely want to leverage some global set()
construct, say an actual global or a class variable or a closure. I will use the latter here:
import pandas
df = pandas.DataFrame({
"left": [
"One",
"Two",
"Three"
],
"center": [
"email1emailcom email2emailcom email3emailcom",
"email1emailcom email15emailcom email12emailcom",
"email15emailcom email16emailcom"
]
})
## ---------------------
## Construct a closure that will use a set() to keep track of distinct emails
## and return work function that will filter out duplicates using the set
## from the emails string passed to it
## ---------------------
def make_email_filter():
distinct_emails = set() ## keep track of distinct emails
## ---------------------
## Work function that will filter out duplicates including in-row
## and cross-row duplicates
## ---------------------
def _work(emails_string):
these_emails = []
for email in emails_string.split():
if email in distinct_emails:
continue
distinct_emails.add(email)
these_emails.append(email)
return " ".join(these_emails)
## ---------------------
## ---------------------
## Return our work function back to the caller
## ---------------------
return _work
## ---------------------
## ---------------------
## ---------------------
## Update our column based on application of the work function
## ---------------------
df["center"] = df["center"].apply(make_email_filter())
## ---------------------
print(df)
That will give us a result like:
left center
0 One email1emailcom email2emailcom email3emailcom
1 Two email15emailcom email12emailcom
2 Three email16emailcom
If you want to ignore both emails you have already seen AND emails from a bad domain, you can just expand your test. Here I will do two tests just to be explicit, but in practice, I would combine these into one test.
import pandas
df = pandas.DataFrame({
"left": [
"One",
"Two",
"Three"
],
"center": [
"email1emailcom email2emailcom email3emailcom",
"email1emailcom email15emailcom email12emailcom",
"email15emailcom email16emailcom foo@amazon.com"
]
})
## ---------------------
## Construct a closure that will use a set to keep track of distinct emails
## and return work function that will filter out duplicates
## from the emails string passed to it.
## Note the addition of an init parameter to allow for filtering out
## emails from a specific domain.
## ---------------------
def make_email_filter(discard_domain=""):
distinct_emails = set() ## keep track of distinct emails
## ---------------------
## Work function that will filter out duplicates including in-row
## and cross-row duplicates
## ---------------------
def _work(emails_string):
these_emails = []
for email in emails_string.split():
## ---------------------
## Ignore this email if it is from a bad domain
## ---------------------
if discard_domain and email.endswith(discard_domain):
continue
## ---------------------
## ---------------------
## Ignore this email if we have already seen it
## ---------------------
if email in distinct_emails:
continue
## ---------------------
distinct_emails.add(email)
these_emails.append(email)
return " ".join(these_emails)
## ---------------------
## ---------------------
## Return our work function back to the caller
## ---------------------
return _work
## ---------------------
## ---------------------
## ---------------------
## Update our column based on application of the work function
## ---------------------
df["center"] = df["center"].apply(make_email_filter(discard_domain="@amazon.com"))
## ---------------------
print(df)
Again giving us:
0 One email1emailcom email2emailcom email3emailcom
1 Two email15emailcom email12emailcom
2 Three email16emailcom