I have two pandas dataframes of different time series stamps, and I want to merge them together. The first dataframe is data from model and called df_model
and the second one is observation which is called df_obs
. Brief examples of df_model
and df_obs
are like below:
df_obs
hourtime sig
datetime_doy
2007-01-01 2007-01-01 21:04:26.250014 -9.202896
2007-01-02 2007-01-02 09:19:46.874995 -10.016614
2007-01-04 2007-01-04 10:18:28.125016 -9.721560
2007-01-04 2007-01-04 20:02:26.249995 -9.495712
2007-01-05 2007-01-05 21:21:50.623991 -9.599862
2007-01-06 2007-01-06 21:01:03.750006 -9.657872
2007-01-07 2007-01-07 09:16:24.374987 -9.834714
2007-01-08 2007-01-08 20:19:39.374988 -9.034375
2007-01-09 2007-01-09 10:15:05.624002 -9.256623
2007-01-09 2007-01-09 19:59:03.749987 -9.385364
then
df_model
date time WG2
2007-01-01 10:00:00 0.321475
2007-01-01 22:00:00 0.330627
2007-01-02 10:00:00 0.307581
2007-01-02 22:00:00 0.314543
2007-01-03 10:00:00 0.293417
2007-01-03 22:00:00 0.301927
2007-01-04 10:00:00 0.288573
2007-01-04 22:00:00 0.314799
2007-01-05 10:00:00 0.313870
2007-01-05 22:00:00 0.309083
2007-01-06 10:00:00 0.308628
2007-01-06 22:00:00 0.306380
2007-01-07 10:00:00 0.306606
2007-01-07 22:00:00 0.304614
2007-01-08 10:00:00 0.307809
2007-01-08 22:00:00 0.315039
2007-01-09 10:00:00 0.310244
2007-01-09 22:00:00 0.307171
2007-01-10 10:00:00 0.314156
2007-01-10 22:00:00 0.312657
So what I want to do is to merge these two dataframe by the following rules (listed as pseudo codes):
list_model = []
for i in range(len(df_obs)):
df_temp = df_model[df_model.date == df_obs.index[i]]
if len(df_temp) > 0:
if df_obs.iloc[i].hourtime.hour <= 12:
list_model.append(df_temp.iloc[0]['WG2']) # this means take the morning value of model data if hour of observation is smaller than 12.
elif df_obs.iloc[i].hourtime.hour > 12:
list_model.append(df_temp.iloc[1]['WG2']) # this means take the afternoon value of model data if hour of observation is greater than 12.
else:
list_model.append(np.nan)
df_obs['model'] = list_model
The final result should be like:
df_obs
hourtime sig model
datetime_doy
2007-01-01 2007-01-01 21:04:26.250014 -9.202896 0.330627
2007-01-02 2007-01-02 09:19:46.874995 -10.016614 0.307581
2007-01-04 2007-01-04 10:18:28.125016 -9.721560 0.288573
2007-01-04 2007-01-04 20:02:26.249995 -9.495712 0.314799
...
So basically, it means, take each value of observation, and see whether this observation is created on the morning or on the afternoon. If morning, we select the model data on the morning, and vice versa for afternoon. But the df_obs['hourtime']
is not strictly/exactly equal to df_model['time']
. So is there any efficient way to do this? I knew that my codes below could also work, but it takes a lot time if I got longer time series of model and observation data...
===========================
New doubts, can we fill the morning and evening data for each day? like this
df_obs
hourtime sig model
datetime_doy
2007-01-01 2007-01-01 10:00:00 nan 0.321475
2007-01-01 2007-01-01 21:04:26.250014 -9.202896 0.330627
2007-01-02 2007-01-02 09:19:46.874995 -10.016614 0.307581
2007-01-02 2007-01-02 22:00:00 nan 0.314543
2007-01-03 2007-01-03 10:00:00. nan 0.293417
2007-01-03 2007-01-03 22:00:00 nan 0.314543
2007-01-04 2007-01-04 10:18:28.125016 -9.721560 0.288573
2007-01-04 2007-01-04 20:02:26.249995 -9.495712 0.314799
Thanks!
===========================
Another new doubts: what if there exist some irregular data in df_obs
? I found that in my dataframe, there are two/multiple data on one same morning/afternoon...for example,
df_obs
hourtime sig
datetime_doy
2007-01-01 2007-01-01 21:02:41.249980 -10.927304
2007-01-03 2007-01-03 10:41:56.249987 -10.347592
2007-01-03 2007-01-03 22:01:20.624977 -10.744881
2007-01-05 2007-01-05 10:00:37.498999 -10.884786
2007-01-06 2007-01-06 09:39:52.499998 -11.205445
... ...
2019-12-31 2019-12-31 09:06:39.375021 -10.048244
2019-12-31 2019-12-31 09:41:50.625006 -10.131855
2019-12-31 2019-12-31 20:26:03.750011 -10.098373
2019-12-31 2019-12-31 21:01:09.375001 -10.155582
2020-01-01 2020-01-01 10:00:39.374989 -10.515008
You could see there are two morning data on 2019-12-31
. What if I want to get the average/mean data of those two data points if they are in the same morning/afternoon? what can I do? Thanks!
You can try this:
from datetime import time
df_obs['morning'] = df_obs.hourtime.astype('datetime64')\
.apply(lambda x: x.time() < time(12,00))
df_model['morning'] = df_model.time.astype('datetime64')\
.apply(lambda x: x.time() < time(12,00))
In this case the column is called morning
and has type bool
morning
) and transform your resulting dataframe as you wish (filter/remane columns, set index, etc.):df_obs.reset_index()\
.merge(df_model, left_on=['datetime_doy', 'morning'], right_on=['date', 'morning'])\
[['datetime_doy', 'hourtime', 'sig', 'WG2']]\
.set_index('datetime_doy')\
.rename(columns={'WG2': 'model'})
Explanation:
this is a single command but broken in line for better understanding.
row 1: reset index to get datetime_doy
as a column.
row 2: merge the two dataframes
row 3: filter only the columns that you need
row 4: set datetime_doy
again as index
row 5: rename the column WG2
to model
Result:
hourtime sig model
datetime_doy
2007-01-01 2007-01-01 21:04:26.250014 -9.202896 0.330627
2007-01-02 2007-01-02 09:19:46.874995 -10.016614 0.307581
2007-01-04 2007-01-04 10:18:28.125016 -9.721560 0.288573
2007-01-04 2007-01-04 20:02:26.249995 -9.495712 0.314799
2007-01-05 2007-01-05 21:21:50.623991 -9.599862 0.309083
2007-01-06 2007-01-06 21:01:03.750006 -9.657872 0.306380
2007-01-07 2007-01-07 09:16:24.374987 -9.834714 0.306606
2007-01-08 2007-01-08 20:19:39.374988 -9.034375 0.315039
2007-01-09 2007-01-09 10:15:05.624002 -9.256623 0.310244
2007-01-09 2007-01-09 19:59:03.749987 -9.385364 0.307171
Complete code:
from datetime import time
df_obs['morning'] = df_obs.hourtime.astype('datetime64').apply(lambda x: x.time() < time(12,00))
df_model['morning'] = df_model.time.astype('datetime64').apply(lambda x: x.time() < time(12,00))
df_obs.reset_index().merge(df_model, left_on=['datetime_doy', 'morning'], right_on=['date', 'morning'])[['datetime_doy', 'hourtime', 'sig', 'WG2']].set_index('datetime_doy').rename(columns={'WG2': 'model'})
To fill each morning and each afternoon you can change the order of the merge:
df_model
.merge(df_obs.reset_index(), right_on=['datetime_doy', 'morning'], left_on=['date', 'morning'], how='outer')
[['date', 'hourtime', 'sig', 'WG2']]
.set_index('date')
.rename(columns={'WG2': 'model'})
Output:
hourtime sig model
date
2007-01-01 NaN NaN 0.321475
2007-01-01 2007-01-01 21:04:26.250014 -9.202896 0.330627
2007-01-02 2007-01-02 09:19:46.874995 -10.016614 0.307581
2007-01-02 NaN NaN 0.314543
2007-01-03 NaN NaN 0.293417
2007-01-03 NaN NaN 0.301927
2007-01-04 2007-01-04 10:18:28.125016 -9.721560 0.288573
2007-01-04 2007-01-04 20:02:26.249995 -9.495712 0.314799
2007-01-05 NaN NaN 0.313870
2007-01-05 2007-01-05 21:21:50.623991 -9.599862 0.309083
2007-01-06 NaN NaN 0.308628
2007-01-06 2007-01-06 21:01:03.750006 -9.657872 0.306380
2007-01-07 2007-01-07 09:16:24.374987 -9.834714 0.306606
2007-01-07 NaN NaN 0.304614
2007-01-08 NaN NaN 0.307809
2007-01-08 2007-01-08 20:19:39.374988 -9.034375 0.315039
2007-01-09 2007-01-09 10:15:05.624002 -9.256623 0.310244
2007-01-09 2007-01-09 19:59:03.749987 -9.385364 0.307171
2007-01-10 NaN NaN 0.314156
2007-01-10 NaN NaN 0.312657