Search code examples
python-3.xfiltermergeuniquecartesian-product

Avoiding cartesian when adding unique classifier to a list in python 3


I have 5 .csv files I am importing and all contain emails:

Donors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Donors Q1 2021 R12.csv",
                        usecols=["Email Address"])
Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Activists Q1 2021 R12.csv",
                        usecols=["Email"])
Low_Level_Activists = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Low Level Activists Q1 2021 R12.csv",
                        usecols=["Email"]) 
Ambassadors = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Ambassadors Q1 2021.csv",
                        usecols=["Email Address"])
Volunteers = pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Volunteers Q1 2021 R12.csv",
                        usecols=["Email Address"])
 Followers= pd.read_csv(r"C:\Users\am\Desktop\email parsing\Q1 2021\Followers Q1 2021 R12.csv",
                        usecols=["Email"])

While I am only importing emails (annoyingly with two different naming conventions because of the systems they originate from), I am adding the import name as a classifer - i.e. Donors, Volunteers, etc.

Donors['Value'] = "Donors"
Activists['Value'] = "Activists"
Low_Level_Activists['Value'] = "Low_Level_Activists"
Ambassadors['Value'] = "Ambassadors"
Volunteers['Value'] = "Volunteers"
Advocates['Value'] = 'Followers'

I then concatenate all the files and handle the naming issue. I am sure there is a more elegant way to do this but here's what I have:

S1= pd.concat([Donors,Activists,Low_Level_Activists,Ambassadors,Volunteers,Advocates], ignore_index= True) 
S1['Handle'] = S1['Email Address'].where(S1['Email Address'].notnull(), S1['Email'])
S1= S1.drop(['Email','Email Address'], axis = 1)
print(S1['Handle'].count()) #checks full count

The total on that last line is 166,749

Here is my problem. I need to filter the emails for uniques - easy enough using .nuniques() and the but the problem I am having is I also need to carry the classifier. So if a singular email is a Donor but also an Activist, I pull both when I try to merge the unique values with the classifier.

I have been at this for many hours (and to the end of the Internet!) and can't seem to find a workable solution. I've tried dictionary for loops, merges, etc. ad infinitum. The unique email count is 165,923 (figured out via Python &/or excel:( ).

Essentially I would want to pull the earliest classifier in my list on a match. So if an email is a Donor and an Activist-> call them a Donor. Or if a email is a Volunteer and a Follower -> call them a Volunteer on one email record.

Any help would be greatly appreciated.


Solution

  • I'll give it a try with some made-up data:

    import pandas as pd
    
    fa = pd.DataFrame([['[email protected]', 'Donors'], ['[email protected]', 'Donors']], columns=['Handle', 'Value'])
    fb = pd.DataFrame([['[email protected]', 'Activists'], ['[email protected]', 'Activists']], columns=['Handle', 'Value'])
    S1 = pd.concat([fa, fb])
    print(S1)
    

    gives

               Handle      Value
    0   [email protected]     Donors
    1    [email protected]     Donors
    0   [email protected]  Activists
    1  [email protected]  Activists
    
    

    You can group by Handle and then pick any Value you like, e.g. the first:

    for handle, group in S1.groupby('Handle'):
        print(handle, group.reset_index().loc[0, 'Value'])
    

    gives

    [email protected] Activists
    [email protected] Donors
    [email protected] Donors
    

    or collect all roles of a person:

    for handle, group in S1.groupby('Handle'):
        print(handle, group.Value.unique())
    

    gives

    [email protected] ['Activists']
    [email protected] ['Donors']
    [email protected] ['Donors' 'Activists']