I want to do a 'modified' df.groupby.median() of a dataset by date and time combination, using the 'count' column.
Below is a sample of the dataset I'm working with:
date time count
0 20160730 02:30 415
1 20160730 02:30 18
2 20160730 02:30 24
3 20160730 02:30 31
4 20160730 13:30 64
... ... ... ...
169549 20170101 23:45 29
169550 20170101 23:45 34
169551 20170101 23:45 43
169552 20170101 23:45 42
169553 20170101 23:45 60
The challenge is, I want to calculate a median that also account for non-entries.
In the dataset, each 'date' and 'time combination has up to 6 rows, because the data was collected from 6 separate locations. However, where the 'count' for a particular date/time/location combination is 0, the data is simply not entered into the dataset.
(I've dropped the 'location' column in the sample data above, but it is available if needed.)
This means that if I use the usual df.groupby.median() function, I would overestimate the true median of the data, as it would ignore the zeroes that were not entered.
In essence, I want to calculate a modified median that goes like this:
For each date and time combination:
count_rows = count number of rows that satisfy the date and time combination
if count_rows == 6:
mod_median = median of the 6 rows
elif count_rows == 5 or count_rows == 4:
mod_median = average of the 3rd and 4th highest row values
elif count_rows == 3:
mod_median = half of the lowest row value
# true median == median of [0, 0, 0, value1, value2, value3]
else
mod_median = 0
# true median == zero for count_rows <= 2
How can I achieve this? Is there a more efficient logic to solve this than what I wrote above?
Thanks in advance
You can use custom function with DataFrame.sort_values
before GroupBy.apply
:
def cust_med(x):
len1 = len(x.index)
if len1 == 6:
return x['count'].median()
if (len1 == 5) or (len1 == 4):
#3rd and 4th higher values
return x.iloc[[2,3], x.columns.get_loc('count')].mean()
if len1 == 3:
#added 3 values with np.median
return np.median(np.concatenate([[0,0,0], x['count']]))
else:
return 0
df = (df.sort_values(['date','time', 'count'], ascending=[True, True, False])
.groupby(['date','time']).apply(cust_med)
.reset_index(name='custom median'))