I have a df,
keys
0 one
1 two,one
2 " "
3 five,one
4 " "
5 two,four
6 four
7 four,five
and two lists,
actual=["one","two"]
syn=["four","five"]
I am creating a new row df["val"]
I am trrying to get the categories of the cells in df["keys"]
. If anyone of the key is present in actual
then i want to add actual in a new column but same row, If anyone of the value is not present in actual then i want the corresponding df["val"] as syn
. and it should not do anything on the white space cells.
My desired output is,
output_df
keys val
0 one actual
1 two,one actual
2 " "
3 five,one actual
4 " "
5 two,four actual
6 four syn
7 four,five syn
Please help, thanks in advance!
Use numpy.select
with double conditions for check membership by compare set
s:
s = df['keys'].str.split(',')
m1 = s.apply(set) & set(actual)
m2 = s.apply(set) & set(syn)
df['part'] = np.select([m1, m2], ['actual','syn'], default='')
print (df)
keys part
0 one actual
1 two,one actual
2
3 five,one actual
4
5 two,four actual
6 four syn
7 four,five syn
Timings:
df = pd.concat([df] * 10000, ignore_index=True)
In [143]: %%timeit
...: s = df['keys'].str.split(',')
...: m1 = s.apply(set) & set(actual)
...: m2 = s.apply(set) & set(syn)
...:
1 loop, best of 3: 160 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ' s solution
In [144]: %%timeit
...: v = df['keys'].str.split(',',expand=True)
...: m1 = v.isin(["one","two"]).any(1)
...: m2 = v.isin(["four","five"]).any(1)
...:
1 loop, best of 3: 193 ms per loop
Caveat:
Performance really depends on the data.