python performance tuples pandas-groupby unique

How to get unique emails in a two column list where values can have two (or more!) separate values

This is a repost of a similar question I asked a few weeks back and I think I am able to do this but at significant (read: untenable) cost to my performance. There was a heroic poster at there who helped me somewhat and I am reposting at his/her direction. My code:

Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
                        usecols=["Email Address"], na_values= True)
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
                        usecols=["Email"], na_filter= True)
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
                        usecols=["Email"], na_filter= True)
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Student Ambassadors Q1 2021.csv",
                        usecols=["Email Address"], na_filter= True)
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
                        usecols=["Email Address"], na_filter= True)

Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'

S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count

Now here is the difficult part - an email address can be a Donor and Activist, Activist and Volunteer, all five Values and so on but mostly it's one email one value. I need to get a Dataframe (or list or dictionary) where I capture a unique email address in it's earliest import class. If an email is a donor and a volunteer than it's a donor. If it's an Activist and a Low Level Activist then Activist and so on.

I am looking for output that has each Value group along with the unique count of emails. This code:

pd.Series({value: len(group.Handle.unique()) for value, group in S1.groupby('Value')})

does the trick (h/t Jan W.) but it does not condense the email counts based on the earliest "value." Help would be greatly appreciated as I have been at this for many days. There are 1,700 instances in this set where a email exists in two or more places among 167,000 email addresses. I can't seem to get around using a for loop then switching back to pandas series etc. etc. If I can get efficient code to figure this out again would be greatly appreciated!

Sample data (again h/t Jan W):

fa = pd.DataFrame([['[email protected]', 'Donors'], ['[email protected]', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['[email protected]', 'Activists'], ['[email protected]', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])

Solution

Combining pandas groupby and sort_values methods will allow to run everything quite fast. The snippet below is an example implementation.

First generate ~100K rows of random data (this is a bit twisted but not relevant to your actual problem):

from itertools import combinations
import string

import numpy as np
import pandas as pd


emails = [
    f"{''.join(chars)}@{domain}"
    for chars in combinations(string.ascii_lowercase, 4)
    for domain in ["example.com", "domain.org"]
]
roles = [
    "Donors",
    "Activists",
    "Low_Level_Activists",
    "Ambassadors",
    "Ambassadors",
    "Ambassadors",
]

n = 100_000

df = pd.DataFrame(
    {
        "Handle": np.random.choice(emails, size=n),
        "Value": np.random.choice(roles, size=n),
    }
).drop_duplicates()

Now df looks like:

                 Handle                Value
0       [email protected]          Ambassadors
1       [email protected]  Low_Level_Activists
2       [email protected]  Low_Level_Activists
3       [email protected]          Ambassadors
4      [email protected]  Low_Level_Activists
...                 ...                  ...
99993   [email protected]            Activists
99994   [email protected]          Ambassadors
99995  [email protected]          Ambassadors
99997  [email protected]          Ambassadors
99999   [email protected]               Donors

[62609 rows x 2 columns]

Then fetch for each email the first role it belongs to:

df = df.sort_values("Value").groupby("Handle", as_index=False)["Value"].first()

                 Handle        Value
0       [email protected]    Activists
1      [email protected]       Donors
2       [email protected]  Ambassadors
3      [email protected]    Activists
4       [email protected]  Ambassadors
...                 ...          ...
28848  [email protected]    Activists
28849   [email protected]  Ambassadors
28850  [email protected]  Ambassadors
28851   [email protected]  Ambassadors
28852  [email protected]    Activists
[28853 rows x 2 columns]

And finally count the number of emails per role:

df = df.groupby("Value")["Handle"].nunique()

Value
Activists              12765
Ambassadors            13987
Donors                  1374
Low_Level_Activists      727
Name: Handle, dtype: int64

Once df is defined, the two calculations above only run in ~100ms on my laptop.