Search code examples
pandasdataframeindexingin-placepandas-settingwithcopy-warning

Setting non-constant value on a subset of rows and columns in a dataframe


Let's assume I have a dataframe D:

D = pd.DataFrame({'a': [1,2,3], 'b': [3,3,4], 'c': [2, 4, 0], 'd': [-1,-1,-1]})
D

    a   b   c   d
0   1   3   2   -1
1   2   3   4   -1
2   3   4   0   -1

I want to sort the values in columns a, b, c in, say, first two rows, so the expected result would be

    a   b   c   d
0   1   2   3   -1
1   2   3   4   -1
2   3   4   0   -1

To avoid chained assignment warning I try to do it with df.loc:

D.loc[[0,1], ['a', 'b', 'c']].agg(['min', 'median', 'max'], axis=1)

    min median  max
0   1.0 2.0 3.0
1   2.0 3.0 4.0
D.loc[[0,1], ['a', 'b', 'c']] = D.loc[[0,1], ['a', 'b', 'c']].agg(['min', 'median', 'max'], axis=1)
D

    a   b   c   d
0   NaN NaN NaN -1
1   NaN NaN NaN -1
2   3.0 4.0 0.0 -1

For some reason the values don't set. I thought that it could because of different column names but D.loc[[0,1], ['a', 'b', 'c']] = D.loc[[0,1], ['a', 'b', 'c']].agg(['min', 'median', 'max'], axis=1).rename(columns={"min": "a", "median": "b", "max": "c"}) and D.loc[[0,1], ['a', 'b', 'c']] = D.loc[[0,1], ['a', 'b', 'c']].agg(['min', 'median', 'max'], axis=1).to_numpy() didn't work too. Changing the values manually isn't an option because in the real example I have about 800k rows instead of 3 and want to keep the operation vectorised. So what should the correct way of doing it?


Solution

  • Use numpy.sort:

    D.loc[[0,1], ['a', 'b', 'c']] = np.sort(D.loc[[0,1], ['a', 'b', 'c']], axis=1)
    print (D)
       a  b  c  d
    0  1  2  3 -1
    1  2  3  4 -1
    2  3  4  0 -1
    

    If need assign aggregate function convert ouput to numpy array by DataFrame.to_numpy:

    D.loc[[0,1], ['a', 'b', 'c']] = D.loc[[0,1], ['a', 'b', 'c']].agg(['min', 'median', 'max'], axis=1).to_numpy()
    print (D)
       a  b  c  d
    0  1  2  3 -1
    1  2  3  4 -1
    2  3  4  0 -1