Search code examples
pythonpandasnormalizationmedianquantile

Improve code to replace values above the median by the median value itself in a pandas DataFrame


I'd like to replace all the values above the median value of a column by the median value of the column itself.

Here is my DataFrame:

m = pd.DataFrame({
    'a': xrange(5),
    'b': xrange(5, 10),
    'c': xrange(10,15)})

print m

    a   b   c
0   0   5   10
1   1   6   11
2   2   7   12
3   3   8   13
4   4   9   14

Here is my solution:

for col in m.columns:
    quart = m[col].median()
    m[col] = [val if val < quart else quart for val in m[col]]

print m

    a   b   c
0   0   5   10
1   1   6   11
2   2   7   12
3   2   7   12
4   2   7   12

I am not familiar with data frame so I was wondering if it is possible to do this in a more 'pandas' way or by using some fancy linear algebra.

Thank you in advance for the reply.


Edit answer:

Here is a quick timeit for the solutions from hurrial and chrisb respectively:

%timeit m.apply(lambda col: np.where(col.median() < col, col.median(), col))
1000 loops, best of 3: 1.36 ms per loop

%timeit np.minimum(m, m.median())
1000 loops, best of 3: 400 µs per loop

The solution using np.minimum seems to be faster.

Thank you I've learnt 2 powerful things today, np.where and np.minimum !


Solution

  • There are a handful of different ways to do this. In general, using a list comprehension is not an efficient way express a pandas operation - that particular line could be rewritten as (see the indexing docs).

    m.loc[m[col] >= val, col] = quart
    

    But the whole operation could be written in one line, like this (importing numpy as np):

    In [211]: m = np.minimum(m, m.median())
    
    In [212]: m
    Out[212]: 
       a  b   c
    0  0  5  10
    1  1  6  11
    2  2  7  12
    3  2  7  12
    4  2  7  12