This is a repost of a similar question I asked a few weeks back and I think I am able to do this but at significant (read: untenable) cost to my performance. There was a heroic poster at there who helped me somewhat and I am reposting at his/her direction. My code:
Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
usecols=["Email Address"], na_values= True)
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
usecols=["Email"], na_filter= True)
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
usecols=["Email"], na_filter= True)
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Student Ambassadors Q1 2021.csv",
usecols=["Email Address"], na_filter= True)
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
usecols=["Email Address"], na_filter= True)
Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'
S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count
Now here is the difficult part - an email address can be a Donor and Activist, Activist and Volunteer, all five Values and so on but mostly it's one email one value. I need to get a Dataframe (or list or dictionary) where I capture a unique email address in it's earliest import class. If an email is a donor and a volunteer than it's a donor. If it's an Activist and a Low Level Activist then Activist and so on.
I am looking for output that has each Value group along with the unique count of emails. This code:
pd.Series({value: len(group.Handle.unique()) for value, group in S1.groupby('Value')})
does the trick (h/t Jan W.) but it does not condense the email counts based on the earliest "value." Help would be greatly appreciated as I have been at this for many days. There are 1,700 instances in this set where a email exists in two or more places among 167,000 email addresses. I can't seem to get around using a for loop then switching back to pandas series etc. etc. If I can get efficient code to figure this out again would be greatly appreciated!
Sample data (again h/t Jan W):
fa = pd.DataFrame([['[email protected]', 'Donors'], ['[email protected]', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['[email protected]', 'Activists'], ['[email protected]', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])
Combining pandas groupby
and sort_values
methods will allow to run everything quite fast. The snippet below is an example implementation.
First generate ~100K rows of random data (this is a bit twisted but not relevant to your actual problem):
from itertools import combinations
import string
import numpy as np
import pandas as pd
emails = [
f"{''.join(chars)}@{domain}"
for chars in combinations(string.ascii_lowercase, 4)
for domain in ["example.com", "domain.org"]
]
roles = [
"Donors",
"Activists",
"Low_Level_Activists",
"Ambassadors",
"Ambassadors",
"Ambassadors",
]
n = 100_000
df = pd.DataFrame(
{
"Handle": np.random.choice(emails, size=n),
"Value": np.random.choice(roles, size=n),
}
).drop_duplicates()
Now df
looks like:
Handle Value
0 [email protected] Ambassadors
1 [email protected] Low_Level_Activists
2 [email protected] Low_Level_Activists
3 [email protected] Ambassadors
4 [email protected] Low_Level_Activists
... ... ...
99993 [email protected] Activists
99994 [email protected] Ambassadors
99995 [email protected] Ambassadors
99997 [email protected] Ambassadors
99999 [email protected] Donors
[62609 rows x 2 columns]
Then fetch for each email the first role it belongs to:
df = df.sort_values("Value").groupby("Handle", as_index=False)["Value"].first()
Handle Value
0 [email protected] Activists
1 [email protected] Donors
2 [email protected] Ambassadors
3 [email protected] Activists
4 [email protected] Ambassadors
... ... ...
28848 [email protected] Activists
28849 [email protected] Ambassadors
28850 [email protected] Ambassadors
28851 [email protected] Ambassadors
28852 [email protected] Activists
[28853 rows x 2 columns]
And finally count the number of emails per role:
df = df.groupby("Value")["Handle"].nunique()
Value
Activists 12765
Ambassadors 13987
Donors 1374
Low_Level_Activists 727
Name: Handle, dtype: int64
Once df
is defined, the two calculations above only run in ~100ms on my laptop.