Search code examples
pythonpandasintervalsmedian

Calculate Grouped Median if you have an numeric interval


This is my dataframe with interval numbers (classes) in it.

     df = pd.DataFrame({'Class': [1,2,3,4,5,6,7,8,9,10,11], 
               'Class Interval': ['16.25-18.75', '18.75-21.25', '21.25-23.75', 
                                  '23.75-26.25', '26.25-28.75', '28.75-31.25',
                                  '31.25-33.75', '33.75-36.25', '36.25-38.75',
                                  '38.75-41.25', '41.25-43.75'],
              '𝑓𝑖' : [2,7,7,14,17,24,11,11,3,3,1],
              'Cumulative 𝑓𝑖': [2,9,16,30,47,71,82,93,96,99,100],
              '𝑓𝑖/n' : [.02,.07,.07,.14,.17,.24,.11,.11,.03,.03,.01],
              'Cumulative 𝑓𝑖/n' : [.02, .09,.16,.30,.47,.71,.82,.93,.96,.99,1.00]})
    df

        Class   Class Interval   𝑓𝑖   Cumulative  𝑓𝑖     𝑓𝑖/𝑛    Cumulative  𝑓𝑖/𝑛 
    0   1       16.25-18.75      2          2            0.02   0.02
    1   2       18.75-21.25      7          9            0.07   0.09
    2   3       21.25-23.75      7         16            0.07   0.16
    3   4       23.75-26.25     14         30            0.14   0.30
    4   5       26.25-28.75     17         47            0.17   0.47
    5   6       28.75-31.25     24         71            0.24   0.71
    6   7       31.25-33.75     11         82            0.11   0.82
    7   8       33.75-36.25     11         93            0.11   0.93
    8   9       36.25-38.75     3          96            0.03   0.96
    9   10      38.75-41.25     3          99            0.03   0.99
    10  11      41.25-43.75     1         100            0.01   1.00

Question: How can I calculate with python the grouped median for this dataframe?

Manually this can be done and the outcome is 29.06.

I've tried 'median_grouped':

    # importing median_grouped from the statistics module 
    from statistics import median_grouped

    # printing median_grouped for the set 
    print("Grouped Median is %s" %(median_grouped(df['Class Interval']))) 

But I got the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-491000133032> in <module>
      4 
      5 # printing median_grouped for the set
----> 6 print("Grouped Median is %s" %(median_grouped(df['Class Interval'])))

~\Anaconda3\ANACONDA\lib\statistics.py in median_grouped(data, interval)
    463     for obj in (x, interval):
    464         if isinstance(obj, (str, bytes)):
--> 465             raise TypeError('expected number but got %r' % obj)
    466     try:
    467         L = x - interval/2  # The lower limit of the median interval.

TypeError: expected number but got '28.75-31.25'

Than I've tried to make two columns (one with the lower limit and one with the higher limit), but than he only gave me the lower limit (28.75) / higher limit median (31.25). I've also tried only the lower limit, but of course than he gives me 28.75 as well.

I do not have the values within the intervals, so I can not remake a list of values to cut with pd.cut and try it like that correctly (I do not want to guess), but I've also tried to manually make the class interval into bins (for example 16.25-18.25 is than (16.25,18.25], but than I got the error message: TypeError: unorderable types: Interval() < float()

Is there a possibility to make the column with the interval numeric in stead of a string to be able to calculate the grouped Median automatically with Python?


Solution

  • You can recreate a list of artificial data points holding the same statistical information (middle value of each interval * fi of the interval), and run the mean_grouped function in them:

    # Obtaining lower, upper and middle interval value
    df['lower'] = df['Class Interval'].str.split('-', expand=True)[0].astype(float)
    df['upper'] = df['Class Interval'].str.split('-', expand=True)[1].astype(float)
    df['middle'] = (df['lower'] + df['upper'] ) / 2
    
    # Generating an artificial list of values with the same statistical info
    artificial_data_list = []
    for index, row in df.iterrows():
      artificial_data_list.append([row['middle']]*row['𝑓𝑖'])
    flat_list = [item for sublist in artificial_data_list for item in sublist]
    
    # Calcuating the right median with the statistics.mean_grouped function
    median_grouped(flat_list,interval=2.5)   # Attention to the interval size!
    # => 29.0625