Search code examples
pythonperformancetuplespandas-groupbyunique

How to get unique emails in a two column list where values can have two (or more!) separate values


This is a repost of a similar question I asked a few weeks back and I think I am able to do this but at significant (read: untenable) cost to my performance. There was a heroic poster at there who helped me somewhat and I am reposting at his/her direction. My code:

Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
                        usecols=["Email Address"], na_values= True)
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
                        usecols=["Email"], na_filter= True)
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
                        usecols=["Email"], na_filter= True)
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Student Ambassadors Q1 2021.csv",
                        usecols=["Email Address"], na_filter= True)
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
                        usecols=["Email Address"], na_filter= True)

Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'

S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True)
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count

Now here is the difficult part - an email address can be a Donor and Activist, Activist and Volunteer, all five Values and so on but mostly it's one email one value. I need to get a Dataframe (or list or dictionary) where I capture a unique email address in it's earliest import class. If an email is a donor and a volunteer than it's a donor. If it's an Activist and a Low Level Activist then Activist and so on.

I am looking for output that has each Value group along with the unique count of emails. This code:

pd.Series({value: len(group.Handle.unique()) for value, group in S1.groupby('Value')})

does the trick (h/t Jan W.) but it does not condense the email counts based on the earliest "value." Help would be greatly appreciated as I have been at this for many days. There are 1,700 instances in this set where a email exists in two or more places among 167,000 email addresses. I can't seem to get around using a for loop then switching back to pandas series etc. etc. If I can get efficient code to figure this out again would be greatly appreciated!

Sample data (again h/t Jan W):

fa = pd.DataFrame([['[email protected]', 'Donors'], ['[email protected]', 'Donors']], columns=['Handle', 'Value'])
fb = pd.DataFrame([['[email protected]', 'Activists'], ['[email protected]', 'Activists']], columns=['Handle', 'Value'])
S1 = pd.concat([fa, fb])

Solution

  • Combining pandas groupby and sort_values methods will allow to run everything quite fast. The snippet below is an example implementation.

    First generate ~100K rows of random data (this is a bit twisted but not relevant to your actual problem):

    from itertools import combinations
    import string
    
    import numpy as np
    import pandas as pd
    
    
    emails = [
        f"{''.join(chars)}@{domain}"
        for chars in combinations(string.ascii_lowercase, 4)
        for domain in ["example.com", "domain.org"]
    ]
    roles = [
        "Donors",
        "Activists",
        "Low_Level_Activists",
        "Ambassadors",
        "Ambassadors",
        "Ambassadors",
    ]
    
    n = 100_000
    
    df = pd.DataFrame(
        {
            "Handle": np.random.choice(emails, size=n),
            "Value": np.random.choice(roles, size=n),
        }
    ).drop_duplicates()
    

    Now df looks like:

                     Handle                Value
    0       [email protected]          Ambassadors
    1       [email protected]  Low_Level_Activists
    2       [email protected]  Low_Level_Activists
    3       [email protected]          Ambassadors
    4      [email protected]  Low_Level_Activists
    ...                 ...                  ...
    99993   [email protected]            Activists
    99994   [email protected]          Ambassadors
    99995  [email protected]          Ambassadors
    99997  [email protected]          Ambassadors
    99999   [email protected]               Donors
    
    [62609 rows x 2 columns]
    

    Then fetch for each email the first role it belongs to:

    df = df.sort_values("Value").groupby("Handle", as_index=False)["Value"].first()
    
                     Handle        Value
    0       [email protected]    Activists
    1      [email protected]       Donors
    2       [email protected]  Ambassadors
    3      [email protected]    Activists
    4       [email protected]  Ambassadors
    ...                 ...          ...
    28848  [email protected]    Activists
    28849   [email protected]  Ambassadors
    28850  [email protected]  Ambassadors
    28851   [email protected]  Ambassadors
    28852  [email protected]    Activists
    [28853 rows x 2 columns]
    

    And finally count the number of emails per role:

    df = df.groupby("Value")["Handle"].nunique()
    
    Value
    Activists              12765
    Ambassadors            13987
    Donors                  1374
    Low_Level_Activists      727
    Name: Handle, dtype: int64
    

    Once df is defined, the two calculations above only run in ~100ms on my laptop.