Search code examples
pythonpandasgroup-bytime-seriesinterpolation

I have time-series data for entry/exit counts for different train stations, how can I standardize the datetime frequency across the data?


The data that I currently have resembles something like the table below:

Station A operates from 12AM to 1AM, Station B operates from 1AM to 2AM and Station C operates from 2AM onwards.

Date Time Station Entries Exits
2023-01-01 00:00:00 Station A 1 1
2023-01-01 01:00:00 Station A 1 1
2023-01-01 01:00:00 Station B 2 2
2023-01-01 02:00:00 Station B 2 2
2023-01-01 02:00:00 Station C 3 3

I would like to add in additional rows so that my dataframe holds 'n' rows at each unique hour frequency, where n is the number of unique Stations. In this example it would mean 3 rows (for Stations A, B, C) for 12AM, 1AM and 2AM for a total of 9. What pandas functions can I use to convert the above table to a format similar to the one below?

Date Time Station Entries Exits
2023-01-01 00:00:00 Station A 1 1
2023-01-01 00:00:00 Station B 0 0
2023-01-01 00:00:00 Station C 0 0
2023-01-01 01:00:00 Station A 1 1
2023-01-01 01:00:00 Station B 2 2
2023-01-01 01:00:00 Station C 0 0
2023-01-01 02:00:00 Station A 0 0
2023-01-01 02:00:00 Station B 2 2
2023-01-01 02:00:00 Station C 3 3

Also including code for generating the initial dataframe below for ease of reference.

df = pd.DataFrame(
    {'Date Time':pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 01:00:00', 
                                                '2023-01-01 01:00:00',  '2023-01-01 02:00:00', 
                                                '2023-01-01 02:00:00']),
     'Station':['Station A', 'Station A', 'Station B', 'Station B', 'Station C'],
     'Entries':[1, 1, 2, 2, 3],
     'Exits':[1, 1, 2, 2, 3]}
)

I've tried using groupby and resampling with hourly frequency, but although this fills in the gaps for hours in between, it does not backfill (i.e. adding additional rows for Station C for the hours of 12 and 1 AM).


Solution

  • If you first perform a long-to-wide conversion using pivot, you can then use stack with dropna=False to create the additional rows:

    (df
     .pivot(index='Date Time', columns='Station')
     .stack(dropna=False)
     .fillna(0)
     .astype('int')
     .reset_index()
    )