Search code examples
pythonstringdataframecontainspartial

Python - keep rows in dataframe based on partial string match


I have 2 dataframes :
df1 is a list of mailboxes and email ids
df2 shows a list of approved domains

I read both the dataframes from an excel sheet

    xls = pd.ExcelFile(input_file_shared_mailbox)
    df = pd.read_excel(xls, sheet_name = sheet_name_shared_mailbox)

i want to only keep records in df1 where df1[Email_Id] contains df2[approved_domain]

    print(df1)  
        Mailbox Email_Id  
    0   mailbox1   [email protected]  
    1   mailbox2   [email protected]  
    2   mailbox3   [email protected]  

    print(df2)  
        approved_domain  
    0   msn.com  
    1   gmail.com  

and i want df3 which basically shows

    print (df3)  
        Mailbox Email_Id  
    0   mailbox1   [email protected]  
    1   mailbox3   [email protected]  

this is the code i have right now which i think is close but i can't figure out the exact problem in the syntax

df3 = df1[df1['Email_Id'].apply(lambda x: [item for item in x if item in df2['Approved_Domains'].tolist()])]

But get this error

TypeError: unhashable type: 'list'

i spent a lot of time researching the forum for a solution but could not find what i was looking for. appreciate all the help.


Solution

  • So these are the steps you will need to follow to do what you want done for your two data frames

    1.Split your email_address column into two separate columns

         df1['add'], df1['domain'] = df1['email_address'].str.split('@', 1).str
    

    2.Then drop your add column to keep your data frame clean

          df1 = df1.drop('add',axis =1)
    

    3.Get a new Data Frame with only values you want by not selecting any value in the 'domain' column that doesn't match 'approved_doman' column

          df_new = df1[~df1['domain'].isin(df2['approved_domain'])]
    

    4. Drop the 'domain' column in df_new

          df_new = df_new.drop('domain',axis = 1)
    

    This is what the result will be

        mailbox     email_address
    1   mailbox2    [email protected]
    2   mailbox3    [email protected]