Search code examples

Python function to calculate a median without mean in a dataframe

I have a big Dataframe (about 3GB) and I want to calcul a sort of median on a group by on few columns but i don't want to take the mean of the two central elements when i have an even number of values but get the lowest of this two values. I know how to do a normal median, here is an example to reproduce my issue :

import pandas as pd 
data = {'idx':  [1,1,1,1,1,2,2,2,2,2,2,3,3,3,4,4,5],
        'value': [5,12,7,8,10,3,8,4,6,1,19,5,10,12,3,8,14]

df = pd.DataFrame (data, columns = ['idx','value'])

    idx  value  median
0     1    5.0     8.0
1     1   12.0     8.0
2     1    7.0     8.0
3     1    8.0     8.0
4     1   10.0     8.0
5     2    3.0     5.0
6     2    8.0     5.0
7     2    4.0     5.0
8     2    6.0     5.0
9     2    1.0     5.0
10    2   19.0     5.0
11    3    5.0    10.0
12    3   10.0    10.0
13    3   12.0    10.0
14    4    3.0     5.5
15    4    8.0     5.5
16    5   14.0    14.0

But as i said i do not want to have this result.

I want :

  • for idx=2 we have 1,3,4,6,8,19 so with median i get (4+6)/2 -> 5 but i want min(4,6) -> 4
  • for idx=4 we have 3,8 so with median i get (3+8)/2 -> 5.5 but i want min(3,8) -> 3

I can do this whith the function below but the performance is very low :

def calcul_median(x):
    if len(a)%2==1:
        a = np.median(a)
    elif len(a)==0:
        a =a[int((len(a)/2)-1)]
    x['median'] =a
    return x


This function works but it is very slow (50 times slower than median).


The function statistics.median_low do that but that is also slow. 3s with numpy vs 52s with statistics.

I have try another function with argpartition

def calcul_tps_medianv2(x):
    if len(a)%2==1:
        a = np.median(a)
    elif len(a)==0:
    x['median'] =a
    return x

But that is slower than with the statistics solution.

Have you any idea to speed up this function or any other idea ? Thanks for your help.


  • The standard library contains a median_low() function that does just that.

    Tim Pietzcker