Search code examples
pythonpandasmedian

Pandas: Compute sum of values for each unique element ('ID') in a given timeframe and then compute the median across all 'ID's


I am working on a dataset containing data from taxi rides in the city of Chicago. The data contains information per trip such as Taxi ID, Timestamp, Fare etc. among others. A sample of the data is shown below from df.head() with the timestamps in pandas datetime:

                                            Taxi ID     Pickup Community Area   Trip Total
Trip End Timestamp          
2016-04-25 18:00:00     bfec2c659fcbed3b508dc6caaf6f7ef39e17801e8ca0cc...   8.0     16.00
2016-11-16 15:45:00     f45c4bfa9d6c445eb03ab69093f7ec1e9cdecef83cb2ce...   32.0    9.75
2016-01-15 17:45:00     ecfb6f2cdce5d4c4e80218f58070ae719060ee47e648f4...   32.0    14.75
2016-01-20 21:30:00     7d8179131ea9952793af4cda8635e94b56c2b92d3c376c...   32.0    7.25
2016-02-11 09:15:00     d5c4fbae1c0c510364404a90fd477b19f7f03408ce40ff...   35.0    18.75

Assuming I have data for one week, I want to achieve this:

  1. Consider a user-chosen binning choice: e.g.: hour.
  2. Group the data by taxi ID, and sum up the amount made by each taxi within that one hour.
  3. Compute the median of all such 'hourly revenues' between all taxis according to taxi ID (still within that one hour).
  4. Repeat for each hour in the timeframe. For a week, this is 24*7 = 168 times.
  5. Plot the median per hour across all hours in the week.

This can be extended to hours in a week, weeks in a year etc.

As of now, I am able to use pivot_table to, say, obtain the median revenue per hour of the day over the whole timeframe (but this is not for every hour in the week), or resample to obtain the median fare over time (but this is not 'hourly revenue' of the taxis, this is the median fare per trip). Neither of those would achieve exactly what I want, how should I implement this?


Solution

  • I was able to achieve this for my data by first combining df.grouby with the Grouper feature in pandas. Thereby, the data is first grouped by the ID, and for each ID, it's resampled per timeframe (hour for example), and eventually the hourly fares for that particular ID can be summed up, all through this command.

    dfGrouped = (df.groupby(['Taxi ID', pd.Grouper(freq='H', key='Trip End Timestamp')])
         ['Trip Total']
         .sum()
         .unstack(fill_value=0))
    

    Because of the final unstack operation, dfGrouped now contains a 2D array of Taxi ID vs Hour. The median over all taxi IDs (rows) was simply obtained by:

    dfGrouped.median()