I have a database that is built like so:
>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
and I want to generate a dummy variable for the nth largest values per ID (here for n=2) so that it's equal for 1 for all the values that are one of the nth highest value:
id value Largest
0 1 1 0
1 1 2 1
2 1 3 1
3 2 1 0
4 2 2 0
5 2 3 1
6 2 4 1
7 3 1 1
8 4 1 1
I've tried:
df['highest'] = 0
df['highest'].loc[df['value'].isin(df.groupby(['id'])['value'].nlargest(1))] = 1
but that would misatribute values from one ID if that happened to be the highest values in another ID
Setup
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
n = 2
nlargest
and loc
:df['flag'] = 0
df.loc[df.groupby('id').value.nlargest(n).index.get_level_values(1), 'flag'] = 1
np.where
and assign
:This avoids modifying the DataFrame in place.
df.assign(
flag=np.where(
df.index.isin(df.groupby('id').value.nlargest(n).index.get_level_values(1)), 1, 0
)
)
Both result in:
id value flag
0 1 1 0
1 1 2 1
2 1 3 1
3 2 1 0
4 2 2 0
5 2 3 1
6 2 4 1
7 3 1 1
8 4 1 1
As @jezrael pointed out, np.where
isn't really necessary here, since you are looking for binary results, and you can instead use:
df.assign(flag=df.index.isin(df.groupby('id').value.nlargest(n).index.get_level_values(1)).astype(int))