Search code examples
pandasdataframemergetime-seriesreindex

Pandas: Reindex/merge or re-allocate time series (of different time stamps) without interpolation


I have two pandas dataframes of different time series stamps, and I want to merge them together. The first dataframe is data from model and called df_model and the second one is observation which is called df_obs. Brief examples of df_model and df_obs are like below:

df_obs
                               hourtime        sig
datetime_doy                                      
2007-01-01   2007-01-01 21:04:26.250014  -9.202896
2007-01-02   2007-01-02 09:19:46.874995 -10.016614
2007-01-04   2007-01-04 10:18:28.125016  -9.721560
2007-01-04   2007-01-04 20:02:26.249995  -9.495712
2007-01-05   2007-01-05 21:21:50.623991  -9.599862
2007-01-06   2007-01-06 21:01:03.750006  -9.657872
2007-01-07   2007-01-07 09:16:24.374987  -9.834714
2007-01-08   2007-01-08 20:19:39.374988  -9.034375
2007-01-09   2007-01-09 10:15:05.624002  -9.256623
2007-01-09   2007-01-09 19:59:03.749987  -9.385364

then

df_model

date          time        WG2
2007-01-01    10:00:00    0.321475
2007-01-01    22:00:00    0.330627
2007-01-02    10:00:00    0.307581
2007-01-02    22:00:00    0.314543
2007-01-03    10:00:00    0.293417
2007-01-03    22:00:00    0.301927
2007-01-04    10:00:00    0.288573
2007-01-04    22:00:00    0.314799
2007-01-05    10:00:00    0.313870
2007-01-05    22:00:00    0.309083
2007-01-06    10:00:00    0.308628
2007-01-06    22:00:00    0.306380
2007-01-07    10:00:00    0.306606
2007-01-07    22:00:00    0.304614
2007-01-08    10:00:00    0.307809
2007-01-08    22:00:00    0.315039
2007-01-09    10:00:00    0.310244
2007-01-09    22:00:00    0.307171
2007-01-10    10:00:00    0.314156
2007-01-10    22:00:00    0.312657

So what I want to do is to merge these two dataframe by the following rules (listed as pseudo codes):

list_model = []
for i in range(len(df_obs)):
    df_temp = df_model[df_model.date == df_obs.index[i]]
    if len(df_temp) > 0:
        if df_obs.iloc[i].hourtime.hour <= 12:
            list_model.append(df_temp.iloc[0]['WG2']) # this means take the morning value of model data if hour of observation is smaller than 12.
        elif df_obs.iloc[i].hourtime.hour > 12:
            list_model.append(df_temp.iloc[1]['WG2']) # this means take the afternoon value of model data if hour of observation is greater than 12.
    else:
        list_model.append(np.nan)
df_obs['model'] = list_model

The final result should be like:

df_obs
                               hourtime        sig        model
datetime_doy                                      
2007-01-01   2007-01-01 21:04:26.250014  -9.202896   0.330627
2007-01-02   2007-01-02 09:19:46.874995 -10.016614   0.307581
2007-01-04   2007-01-04 10:18:28.125016  -9.721560   0.288573
2007-01-04   2007-01-04 20:02:26.249995  -9.495712   0.314799
...

So basically, it means, take each value of observation, and see whether this observation is created on the morning or on the afternoon. If morning, we select the model data on the morning, and vice versa for afternoon. But the df_obs['hourtime'] is not strictly/exactly equal to df_model['time']. So is there any efficient way to do this? I knew that my codes below could also work, but it takes a lot time if I got longer time series of model and observation data...

===========================

New doubts, can we fill the morning and evening data for each day? like this

df_obs
                               hourtime        sig        model
datetime_doy 
2007-01-01  2007-01-01 10:00:00               nan         0.321475
2007-01-01  2007-01-01 21:04:26.250014  -9.202896         0.330627
2007-01-02  2007-01-02 09:19:46.874995  -10.016614        0.307581
2007-01-02  2007-01-02 22:00:00               nan         0.314543
2007-01-03  2007-01-03 10:00:00.              nan         0.293417
2007-01-03  2007-01-03 22:00:00               nan         0.314543
2007-01-04  2007-01-04 10:18:28.125016  -9.721560         0.288573
2007-01-04  2007-01-04 20:02:26.249995  -9.495712         0.314799

Thanks!

===========================

Another new doubts: what if there exist some irregular data in df_obs? I found that in my dataframe, there are two/multiple data on one same morning/afternoon...for example,

df_obs
                               hourtime        sig
datetime_doy                                      
2007-01-01   2007-01-01 21:02:41.249980 -10.927304
2007-01-03   2007-01-03 10:41:56.249987 -10.347592
2007-01-03   2007-01-03 22:01:20.624977 -10.744881
2007-01-05   2007-01-05 10:00:37.498999 -10.884786
2007-01-06   2007-01-06 09:39:52.499998 -11.205445
                                ...        ...
2019-12-31   2019-12-31 09:06:39.375021 -10.048244
2019-12-31   2019-12-31 09:41:50.625006 -10.131855
2019-12-31   2019-12-31 20:26:03.750011 -10.098373
2019-12-31   2019-12-31 21:01:09.375001 -10.155582
2020-01-01   2020-01-01 10:00:39.374989 -10.515008

You could see there are two morning data on 2019-12-31. What if I want to get the average/mean data of those two data points if they are in the same morning/afternoon? what can I do? Thanks!


Solution

  • You can try this:

    1. Add a columnt in both dataframes (or two temporal dataframes) showing if it morning or afternoon:
    from datetime import time
    
    df_obs['morning'] = df_obs.hourtime.astype('datetime64')\
        .apply(lambda x: x.time() < time(12,00))
    
    df_model['morning'] = df_model.time.astype('datetime64')\
        .apply(lambda x: x.time() < time(12,00))
    

    In this case the column is called morning and has type bool

    1. Merge both dataframes on date and your new column (morning) and transform your resulting dataframe as you wish (filter/remane columns, set index, etc.):
    df_obs.reset_index()\
        .merge(df_model, left_on=['datetime_doy', 'morning'], right_on=['date', 'morning'])\
        [['datetime_doy', 'hourtime', 'sig', 'WG2']]\
        .set_index('datetime_doy')\
        .rename(columns={'WG2': 'model'})
    

    Explanation:

    this is a single command but broken in line for better understanding.

    row 1: reset index to get datetime_doy as a column.

    row 2: merge the two dataframes

    row 3: filter only the columns that you need

    row 4: set datetime_doy again as index

    row 5: rename the column WG2to model

    Result:

                hourtime                    sig         model
    datetime_doy            
    2007-01-01  2007-01-01 21:04:26.250014  -9.202896   0.330627
    2007-01-02  2007-01-02 09:19:46.874995  -10.016614  0.307581
    2007-01-04  2007-01-04 10:18:28.125016  -9.721560   0.288573
    2007-01-04  2007-01-04 20:02:26.249995  -9.495712   0.314799
    2007-01-05  2007-01-05 21:21:50.623991  -9.599862   0.309083
    2007-01-06  2007-01-06 21:01:03.750006  -9.657872   0.306380
    2007-01-07  2007-01-07 09:16:24.374987  -9.834714   0.306606
    2007-01-08  2007-01-08 20:19:39.374988  -9.034375   0.315039
    2007-01-09  2007-01-09 10:15:05.624002  -9.256623   0.310244
    2007-01-09  2007-01-09 19:59:03.749987  -9.385364   0.307171
    

    Complete code:

    from datetime import time
    df_obs['morning'] = df_obs.hourtime.astype('datetime64').apply(lambda x: x.time() < time(12,00))
    df_model['morning'] = df_model.time.astype('datetime64').apply(lambda x: x.time() < time(12,00))
    df_obs.reset_index().merge(df_model, left_on=['datetime_doy', 'morning'], right_on=['date', 'morning'])[['datetime_doy', 'hourtime', 'sig', 'WG2']].set_index('datetime_doy').rename(columns={'WG2': 'model'})
    

    To fill each morning and each afternoon you can change the order of the merge:

    df_model
        .merge(df_obs.reset_index(), right_on=['datetime_doy', 'morning'], left_on=['date', 'morning'], how='outer')
        [['date', 'hourtime', 'sig', 'WG2']]
        .set_index('date')
        .rename(columns={'WG2': 'model'})
    

    Output:

                                  hourtime        sig     model
    date                                                       
    2007-01-01                         NaN        NaN  0.321475
    2007-01-01  2007-01-01 21:04:26.250014  -9.202896  0.330627
    2007-01-02  2007-01-02 09:19:46.874995 -10.016614  0.307581
    2007-01-02                         NaN        NaN  0.314543
    2007-01-03                         NaN        NaN  0.293417
    2007-01-03                         NaN        NaN  0.301927
    2007-01-04  2007-01-04 10:18:28.125016  -9.721560  0.288573
    2007-01-04  2007-01-04 20:02:26.249995  -9.495712  0.314799
    2007-01-05                         NaN        NaN  0.313870
    2007-01-05  2007-01-05 21:21:50.623991  -9.599862  0.309083
    2007-01-06                         NaN        NaN  0.308628
    2007-01-06  2007-01-06 21:01:03.750006  -9.657872  0.306380
    2007-01-07  2007-01-07 09:16:24.374987  -9.834714  0.306606
    2007-01-07                         NaN        NaN  0.304614
    2007-01-08                         NaN        NaN  0.307809
    2007-01-08  2007-01-08 20:19:39.374988  -9.034375  0.315039
    2007-01-09  2007-01-09 10:15:05.624002  -9.256623  0.310244
    2007-01-09  2007-01-09 19:59:03.749987  -9.385364  0.307171
    2007-01-10                         NaN        NaN  0.314156
    2007-01-10                         NaN        NaN  0.312657