I am working on a dataset containing data from taxi rides in the city of Chicago. The data contains information per trip such as Taxi ID, Timestamp, Fare etc. among others. A sample of the data is shown below from df.head()
with the timestamps in pandas datetime:
Taxi ID Pickup Community Area Trip Total
Trip End Timestamp
2016-04-25 18:00:00 bfec2c659fcbed3b508dc6caaf6f7ef39e17801e8ca0cc... 8.0 16.00
2016-11-16 15:45:00 f45c4bfa9d6c445eb03ab69093f7ec1e9cdecef83cb2ce... 32.0 9.75
2016-01-15 17:45:00 ecfb6f2cdce5d4c4e80218f58070ae719060ee47e648f4... 32.0 14.75
2016-01-20 21:30:00 7d8179131ea9952793af4cda8635e94b56c2b92d3c376c... 32.0 7.25
2016-02-11 09:15:00 d5c4fbae1c0c510364404a90fd477b19f7f03408ce40ff... 35.0 18.75
Assuming I have data for one week, I want to achieve this:
This can be extended to hours in a week, weeks in a year etc.
As of now, I am able to use pivot_table
to, say, obtain the median revenue per hour of the day over the whole timeframe (but this is not for every hour in the week), or resample
to obtain the median fare over time (but this is not 'hourly revenue' of the taxis, this is the median fare per trip). Neither of those would achieve exactly what I want, how should I implement this?
I was able to achieve this for my data by first combining df.grouby
with the Grouper
feature in pandas. Thereby, the data is first grouped by the ID, and for each ID, it's resampled per timeframe (hour for example), and eventually the hourly fares for that particular ID can be summed up, all through this command.
dfGrouped = (df.groupby(['Taxi ID', pd.Grouper(freq='H', key='Trip End Timestamp')])
['Trip Total']
.sum()
.unstack(fill_value=0))
Because of the final unstack operation, dfGrouped
now contains a 2D array of Taxi ID vs Hour. The median over all taxi IDs (rows) was simply obtained by:
dfGrouped.median()