Search code examples
pythonpandasdataframedata-analysis

how to split and categorize value in a column of a pandas dataframe


I have a df,

    keys
0   one
1   two,one
2   " "
3   five,one
4   " "
5   two,four
6   four
7   four,five

and two lists,

 actual=["one","two"]
 syn=["four","five"]

I am creating a new row df["val"] I am trrying to get the categories of the cells in df["keys"]. If anyone of the key is present in actual then i want to add actual in a new column but same row, If anyone of the value is not present in actual then i want the corresponding df["val"] as syn. and it should not do anything on the white space cells.

My desired output is,

output_df

    keys      val
0   one       actual
1   two,one   actual
2   " "        
3   five,one  actual
4   " "
5   two,four  actual
6   four      syn
7   four,five syn

Please help, thanks in advance!


Solution

  • Use numpy.select with double conditions for check membership by compare sets:

    s = df['keys'].str.split(',')
    m1 = s.apply(set) & set(actual)
    m2 = s.apply(set) & set(syn)
    
    df['part'] = np.select([m1, m2], ['actual','syn'], default='')
    print (df)
            keys    part
    0        one  actual
    1    two,one  actual
    2                   
    3   five,one  actual
    4                   
    5   two,four  actual
    6       four     syn
    7  four,five     syn
    

    Timings:

    df = pd.concat([df] * 10000, ignore_index=True)
    
    
    In [143]: %%timeit 
         ...: s = df['keys'].str.split(',')
         ...: m1 = s.apply(set) & set(actual)
         ...: m2 = s.apply(set) & set(syn)
         ...: 
    1 loop, best of 3: 160 ms per loop
    
    #cᴏʟᴅsᴘᴇᴇᴅ' s solution
    In [144]: %%timeit
         ...: v = df['keys'].str.split(',',expand=True)
         ...: m1 = v.isin(["one","two"]).any(1)
         ...: m2 = v.isin(["four","five"]).any(1)
         ...: 
    1 loop, best of 3: 193 ms per loop
    

    Caveat:

    Performance really depends on the data.