Search code examples
pythonpandaspandas-groupbypandas-resample

Pandas: Resampling Hourly Data for each Group


I have a dataframe that conains gps locations of vehicles recieved at various times in a day. For each vehicle, I want to resample hourly data such that I have the median report (according to the time stamp) for each hour of the day. For hours where there are no corresponding rows, I want a blank row. I am using the following code:

for i,j in enumerate(list(df.id.unique())):
        data=df.loc[df.id==j]        
        data['hour']=data['timestamp'].hour
        data_grouped=data.groupby(['imo','hour']).median().reset_index()
        data = data_grouped.set_index('hour').reindex(idx).reset_index() #idx is a list of integers from 0 to 23.


Since my dataframe has millions of id's it takes me a lot of time to iterate though all of them. Is there an efficient way of doing this?

Unlike Pandas reindex dates in Groupby, I have multiple rows for each hour, in addition to some hours having no rows at all.


Solution

  • Tested in last version of pandas, convert hour column to categoricals with all possible categories and then aggregate without loop:

    df['hour'] = pd.Categorical(df['timestamp'].dt.hour, categories=range(24))
    df1 = df.groupby(['id','imo','hour']).median().reset_index()