Search code examples
pythonpandasdummy-variable

generating dummy variables for n largest values in Pandas


I have a database that is built like so:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})

>>> df

    id  value    
0   1      1    
1   1      2    
2   1      3    
3   2      1    
4   2      2    
5   2      3    
6   2      4    
7   3      1    
8   4      1

and I want to generate a dummy variable for the nth largest values per ID (here for n=2) so that it's equal for 1 for all the values that are one of the nth highest value:

    id  value  Largest 
0   1      1     0    
1   1      2     1    
2   1      3     1    
3   2      1     0    
4   2      2     0    
5   2      3     1    
6   2      4     1    
7   3      1     1    
8   4      1     1

I've tried:

df['highest'] = 0
df['highest'].loc[df['value'].isin(df.groupby(['id'])['value'].nlargest(1))] = 1

but that would misatribute values from one ID if that happened to be the highest values in another ID


Solution

  • Setup

    df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
    n = 2
    

    Using nlargest and loc:

    df['flag'] = 0
    df.loc[df.groupby('id').value.nlargest(n).index.get_level_values(1), 'flag'] = 1
    

    Using np.where and assign:

    This avoids modifying the DataFrame in place.

    df.assign(
        flag=np.where(
            df.index.isin(df.groupby('id').value.nlargest(n).index.get_level_values(1)), 1, 0
        )
    )
    

    Both result in:

       id  value  flag
    0   1      1     0
    1   1      2     1
    2   1      3     1
    3   2      1     0
    4   2      2     0
    5   2      3     1
    6   2      4     1
    7   3      1     1
    8   4      1     1
    

    As @jezrael pointed out, np.where isn't really necessary here, since you are looking for binary results, and you can instead use:

    df.assign(flag=df.index.isin(df.groupby('id').value.nlargest(n).index.get_level_values(1)).astype(int))