I'd like to replace all the values above the median value of a column by the median value of the column itself.
Here is my DataFrame:
m = pd.DataFrame({
'a': xrange(5),
'b': xrange(5, 10),
'c': xrange(10,15)})
print m
a b c
0 0 5 10
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
Here is my solution:
for col in m.columns:
quart = m[col].median()
m[col] = [val if val < quart else quart for val in m[col]]
print m
a b c
0 0 5 10
1 1 6 11
2 2 7 12
3 2 7 12
4 2 7 12
I am not familiar with data frame so I was wondering if it is possible to do this in a more 'pandas' way or by using some fancy linear algebra.
Thank you in advance for the reply.
Edit answer:
Here is a quick timeit for the solutions from hurrial and chrisb respectively:
%timeit m.apply(lambda col: np.where(col.median() < col, col.median(), col))
1000 loops, best of 3: 1.36 ms per loop
%timeit np.minimum(m, m.median())
1000 loops, best of 3: 400 µs per loop
The solution using np.minimum seems to be faster.
Thank you I've learnt 2 powerful things today, np.where and np.minimum !
There are a handful of different ways to do this. In general, using a list comprehension is not an efficient way express a pandas operation - that particular line could be rewritten as (see the indexing docs).
m.loc[m[col] >= val, col] = quart
But the whole operation could be written in one line, like this (importing numpy as np
):
In [211]: m = np.minimum(m, m.median())
In [212]: m
Out[212]:
a b c
0 0 5 10
1 1 6 11
2 2 7 12
3 2 7 12
4 2 7 12