Search code examples
pythonpandasdatetime-seriesforecasting

Python Pandas: resample based on just one of the columns


I have the following data and I'm resampling my data to find out how many bikes arrive at each of the stations every 15 minutes. However, my code is aggregating my stations too, and I only want to aggregate the variable "dtm_end_trip"

Sample data:

id_trip dtm_start_trip dtm_end_trip start_station end_station
1 2018-10-01 10:15:00 2018-10-01 10:17:00 A B
2 2018-10-01 10:17:00 2018-10-01 10:18:00 B A
... ... ... ... ...
999999 2021-12-31 23:58:00 2022-01-01 00:22:00 C A
1000000 2021-12-31 23:59:00 2022-01-01 00:29:00 A D

Trial code:

df2 =  df(['end_station', 'dtm_end_trip']).size().to_frame(name = 'count').reset_index()
df2 = df2.sort_values(by='count', ascending=False)

df2= df2.set_index('dtm_end_trip')

df2 = df2.resample('15T').count()

Output I get:

dtm_end_trip end_station count
2018-10-01 00:15:00 2 2
2018-10-01 00:30:00 0 0
2018-10-01 00:45:00 1 1
2018-10-01 01:00:00 2 2
2018-10-01 01:15:00 1 1

Desired output:

dtm_end_trip end_station count
2018-10-01 00:15:00 A 2
2018-10-01 00:15:00 B 0
2018-10-01 00:15:00 C 1
2018-10-01 00:15:00 D 2
2018-10-01 00:30:00 A 3
2018-10-01 00:30:00 B 2

The count column of the table above was, in this case, constructed with random numbers with the sole purpose of exemplifying the architecture of the desired output.


Solution

  • You can use pd.Grouper like this:

    out = df.groupby([
        pd.Grouper(freq='15min', key='dtm_end_trip'),
        'end_station',
    ]).size()
    
    >>> out
    dtm_end_trip         end_station
    2018-10-01 10:15:00  A              1
                         B              1
    2022-01-01 00:15:00  A              1
                         D              1
    dtype: int64
    

    The result is a Series, but you can easily convert it to a DataFrame with the same headings as per your desired output:

    >>> out.to_frame('count').reset_index()
             dtm_end_trip end_station  count
    0 2018-10-01 10:15:00           A      1
    1 2018-10-01 10:15:00           B      1
    2 2022-01-01 00:15:00           A      1
    3 2022-01-01 00:15:00           D      1
    

    Note: this is the result from the four rows in your sample input data.