Search code examples
pythonpandascsvgroup-bymedian

Csv pandas groupby with a 'modified' median


I want to do a 'modified' df.groupby.median() of a dataset by date and time combination, using the 'count' column.

Below is a sample of the dataset I'm working with:

              date    time    count
0         20160730    02:30     415
1         20160730    02:30      18
2         20160730    02:30      24
3         20160730    02:30      31
4         20160730    13:30      64
...            ...      ...     ...
169549    20170101    23:45      29
169550    20170101    23:45      34
169551    20170101    23:45      43
169552    20170101    23:45      42
169553    20170101    23:45      60

The challenge is, I want to calculate a median that also account for non-entries.

In the dataset, each 'date' and 'time combination has up to 6 rows, because the data was collected from 6 separate locations. However, where the 'count' for a particular date/time/location combination is 0, the data is simply not entered into the dataset.

(I've dropped the 'location' column in the sample data above, but it is available if needed.)

This means that if I use the usual df.groupby.median() function, I would overestimate the true median of the data, as it would ignore the zeroes that were not entered.

In essence, I want to calculate a modified median that goes like this:

For each date and time combination:
   count_rows = count number of rows that satisfy the date and time combination
   if count_rows == 6:
      mod_median = median of the 6 rows
   elif count_rows == 5 or count_rows == 4:
      mod_median = average of the 3rd and 4th highest row values
   elif count_rows == 3:
      mod_median = half of the lowest row value
      # true median == median of [0, 0, 0, value1, value2, value3]
   else
      mod_median = 0
      # true median == zero for count_rows <= 2

How can I achieve this? Is there a more efficient logic to solve this than what I wrote above?

Thanks in advance


Solution

  • You can use custom function with DataFrame.sort_values before GroupBy.apply:

    def cust_med(x):
        len1 = len(x.index)
        if len1 == 6:
            return x['count'].median()
        if (len1 == 5) or (len1 == 4):
            #3rd and 4th higher values
            return x.iloc[[2,3], x.columns.get_loc('count')].mean()
        if len1 == 3:
            #added 3 values with np.median
            return np.median(np.concatenate([[0,0,0], x['count']]))
        else:
            return 0
    
    df = (df.sort_values(['date','time', 'count'], ascending=[True, True, False])
           .groupby(['date','time']).apply(cust_med)
           .reset_index(name='custom median'))