Search code examples
pythonpandasdataframelambdaassign

Why is this lambda function with df.assign not working row-wise as expected?


I have a dataframe and want to get the idxmin of a values column, but for each ID (which can occure multiple times). My df:

data = pd.DataFrame({'ID': [123, 122,122,122,123,125,126],
                     'values':[ 2, 1, 2, 8, 6, 4, 5]})

No I would use a lambda function, filter the df to get a subselected df for all ID occurences and use idxmin to get the min index value of that subselect. When I use the different parts alone, they work as intended, but when I use it together, it just outputs the same ID (overall idxmin) for every row.

data.assign(TEST = lambda x: data.loc[data["ID"]==x["ID"],"values"].idxmin())

Output:

Index ID values TEST
0 123 2 1
1 122 1 1
2 122 2 1
3 122 8 1
4 123 6 1
5 125 4 1
6 126 5 1

Does anybody know why the behaviour is like that instead of:

Index ID values TEST
0 123 2 0
1 122 1 1
2 122 2 1
3 122 8 1
4 123 6 0
5 125 4 5
6 126 5 6

Thanks!


Solution

  • In your assign, x is the full dataframe, thus

    data.loc[data["ID"]==data["ID"],"values"].idxmin()
    

    returns 1, and your code is equivalent to:

    data.assign(TEST=1)
    

    You need to use groupby here:

    data['TEST'] = data.groupby('ID')['values'].transform('idxmin')
    

    output:

        ID  values  TEST
    0  123       2     0
    1  122       1     1
    2  122       2     1
    3  122       8     1
    4  123       6     0
    5  125       4     5
    6  126       5     6