I want to create a column based on a group and threshold for cutoff from another column for each group of the grouped column.
The dataframe is below:
df_in ->
unique_id myvalue identif
0 CTA15 19.0 TOP
1 CTA15 22.0 TOP
2 CTA15 28.0 TOP
3 CTA15 18.0 TOP
4 CTA15 22.4 TOP
5 AC007 2.0 TOP
6 AC007 2.3 SDME
7 AC007 2.0 SDME
8 AC007 5.0 SDME
9 AC007 3.0 SDME
10 AC007 31.4 SDME
11 AC007 4.4 SDME
12 CGT6 9.7 BTME
13 CGT6 44.5 BTME
14 TVF5 6.7 BTME
15 TVF5 9.1 BTME
16 TVF5 10.0 BTME
17 BGD1 1.0 BTME
18 BGD1 1.6 NON
19 GHB 51.0 NON
20 GHB 54.0 NON
21 GHB 4.7 NON
So I have created a dictionary based on each group of the 'identif' column as :
md = {'TOP': 22, 'SDME': 10, 'BTME': 20, 'NON':20}
So my goal is to create a new column, say 'chk', based on the following condition:
If the "identif" column matches the key in the dictionary "md" and the value for that key is >= than the corresponding value in the "myvalue" column then I will have 1, otherwise 0.
However, I am trying to find a good way using map/groupby/apply to create the new output data frame. I am now doing a very inefficient way ( which is taking considerable time on real data of million rows) using a function as follows:
def myfilter(df, idCol, valCol, mydict):
for index,row in df.iterrows():
for key, value in mydict.items():
if row[idCol] == key and row[valCol] >= value:
df['chk'] = 1
elif row[idCol] == key and row[valCol] < value:
df['chk'] = 0
return df
Getting the output via the following call:
df_out = myfilter(df_in, 'identif', 'myvalue', md)
So my output will be like:
df_out ->
unique_id myvalue identif chk
0 CTA15 19.0 TOP 0
1 CTA15 22.0 TOP 1
2 CTA15 28.0 TOP 1
3 CTA15 18.0 TOP 0
4 CTA15 22.4 TOP 1
5 AC007 2.0 TOP 0
6 AC007 2.3 SDME 0
7 AC007 2.0 SDME 0
8 AC007 5.0 SDME 0
9 AC007 3.0 SDME 0
10 AC007 31.4 SDME 1
11 AC007 4.4 SDME 0
12 CGT6 9.7 BTME 0
13 CGT6 44.5 BTME 1
14 TVF5 6.7 BTME 0
15 TVF5 9.1 BTME 0
16 TVF5 10.0 BTME 0
17 BGD1 1.0 BTME 0
18 BGD1 1.6 NON 0
19 GHB 51.0 NON 1
20 GHB 54.0 NON 1
21 GHB 4.7 NON 0
This works but extremely inefficient and would like a much better way to do it.
This should be faster:
def func(identif, value):
if identif in md:
if value >= md[identif]:
return 1.0
else:
return 0.0
else:
return np.NaN
df['chk'] = df.apply(lambda row: func(row['identif'], row['myvalue']), axis=1)
The timing on this little example:
CPU times: user 1.64 ms, sys: 73 µs, total: 1.71 ms
Wall time: 1.66 ms
Your version timing:
CPU times: user 8.6 ms, sys: 1.92 ms, total: 10.5 ms
Wall time: 8.79 ms
Although on such a small example it's not conclusive.