I've searched and searched and haven't found a solution or just an answer to why this is happening.
I'm trying to determine the mean()
per Borough for Request_Closing_Time
in my pandas dataframe. When I groupby
Borough and then get the mean for Request Closing Time I'm getting very large values, 13
to 44
days and then even 87,426
days for one result. The max value is only 24
days and the average for the whole dataframe is only 4
hours. If I single out an individual Borough with the get_group command then I seem to get a more normal answer which I believe is correct.
Any ideas as to why values for mean are so far off when groupby
is by Borough only?
In [283]:nyc_sr_calls_trimmed.describe()
Out[283]: Unique Key Request_Closing_Time Incident Zip
count 3.006980e+05 298534 298083.000000
mean 3.130054e+07 0 days 04:18:51.832782865 10848.888645
std 5.738547e+05 0 days 06:05:22.141833856 583.182081
min 3.027948e+07 0 days 00:01:00 83.000000
25% 3.080118e+07 0 days 01:16:33 10310.000000
50% 3.130436e+07 0 days 02:42:55.500000 11208.000000
75% 3.178446e+07 0 days 05:21:00 11238.000000
max 3.231065e+07 24 days 16:52:22 11697.000000
In [284]: nyc_sr_calls_trimmed['Request_Closing_Time'].mean()
Out[284]:Timedelta('0 days 04:18:51.832782865')
In [285]:by_burrough = nyc_sr_calls_trimmed.groupby(['Borough'])
by_burrough['Request_Closing_Time'].mean(numeric_only=None).dt.floor('s')
Out[285]:Borough
BRONX -13 days +03:21:47
BROOKLYN -13 days +03:18:24
MANHATTAN -36 days +14:38:51
QUEENS -16 days +07:37:05
STATEN ISLAND -44 days +22:01:24
Unspecified -87426 days +14:15:03
Name: Request_Closing_Time, dtype: timedelta64[ns]
In [286]:b = by_burrough.get_group('QUEENS')
b['Request_Closing_Time'].mean(numeric_only=False)
Out[286]:Timedelta('0 days 04:52:16.450111002')
Any help would be appreciated. I'm not sure what i'm not thinking of or accounting for. Thanks.
"If I single out an individual Borough with the get_group command then I seem to get a more normal answer which I believe is correct".
Since you mentioned the above, something I would try is to get the mean of Request_Closing_Time
for each Borough individually, as follows:
by_burrough["Request_Closing_Time"].apply(lambda df_group: df_group.mean(numeric_only=False))
or
by_burrough.apply(lambda df_group: df_group["Request_Closing_Time"].mean(numeric_only=False))