I got a dataframe like this:
gpi_data[['sig','hourtime']]
Out[28]:
sig hourtime
datetime_doy
2007-01-02 -8.963545 2007-01-02 09:20:11.249998
2007-01-03 -8.671357 2007-01-03 10:39:31.874991
2007-01-03 -8.996480 2007-01-03 20:22:59.999006
2007-01-04 -8.835958 2007-01-04 10:18:56.249024
2007-01-05 -8.785034 2007-01-05 21:21:39.374002
... ...
2019-12-30 -8.529724 2019-12-30 20:23:01.874996
2019-12-30 -8.563781 2019-12-30 20:48:28.125016
2019-12-30 -8.504211 2019-12-30 21:23:44.999996
2019-12-31 -8.460620 2019-12-31 09:39:31.873999
2019-12-31 -8.230092 2019-12-31 10:18:58.125014
[7983 rows x 2 columns]
and I want to calculate the averaged values of each morning and each afternoon based on hour time
. By morning I mean the data is observed around 10:00:00
, and 22:00:00
for afternoon. If there is no values on the morning/evening on this day, fill it with np.nan
. For example, on 2007-01-01
we don't have any morning or evening values of sig
. Then we fill it with two np.nan
values. Then on 2007-01-02
we only have morning value, so we fill the evening value of 2007-01-02
with np.nan.
SPECIFICALLY, for 2019-12-30
, we have 3 evening values which are 2019-12-30 20:23:01.874996
, 2019-12-30 20:48:28.125016
and 2019-12-30 21:23:44.999996
. So we need to calculate the average value of -8.529724
, -8.563781
and -8.504211
. It's same for the last two datapoints on the morning of 2019-12-31
, we need to average them, and fill the np.nan
to the evening of 2019-12-31
.
So ideally the final result would be:
gpi_data[['sig','hourtime']]
Out[28]:
sig hourtime
datetime_doy
2007-01-01 nan 2007-01-01 10:00:00
2007-01-01 nan 2007-01-01 22:00:00
2007-01-02 -8.963545 2007-01-02 09:20:11.249998
2007-01-02 nan 2007-01-02 22:00:00
2007-01-03 -8.671357 2007-01-03 10:39:31.874991
2007-01-03 -8.996480 2007-01-03 20:22:59.999006
2007-01-04 -8.835958 2007-01-04 10:18:56.249024
2007-01-04 nan 2007-01-04 22:00:00
2007-01-05 nan 2007-01-05 10:00:00
2007-01-05 -8.785034 2007-01-05 21:21:39.374002
... ...
2019-12-30 -8.532572 2019-12-30 22:00:00
2019-12-31 -8.345356 2019-12-31 09:39:31.873999
2019-12-31 nan 2019-12-31 22:00:00
It's fine if we round all hourtime
to 10:00:00 or 22:00:00 like below:
gpi_data[['sig','hourtime']]
Out[28]:
sig hourtime
datetime_doy
2007-01-01 nan 2007-01-01 10:00:00
2007-01-01 nan 2007-01-01 22:00:00
2007-01-02 -8.963545 2007-01-02 10:00:00
2007-01-02 nan 2007-01-02 22:00:00
2007-01-03 -8.671357 2007-01-03 10:00:00
2007-01-03 -8.996480 2007-01-03 22:00:00
2007-01-04 -8.835958 2007-01-04 10:00:00
2007-01-04 nan 2007-01-04 22:00:00
2007-01-05 nan 2007-01-05 10:00:00
2007-01-05 -8.785034 2007-01-05 22:00:00
... ...
2019-12-30 -8.532572 2019-12-30 22:00:00
2019-12-31 -8.460620 2019-12-31 10:00:00
2019-12-31 nan 2019-12-31 22:00:00
How can I do it? is there anybody who can help me? Thanks!
Use cut
for defined 10
and 22
column by some thresholds, here is used 12
and 23
hours.
Then create MultiIndex
by minimal and maximal years in MultiIndex.from_product
, aggregate mean
and add missing combinations by Series.reindex
, last create hourtime
column:
df['hourtime'] = pd.cut(df['hourtime'].dt.hour, bins=[0,12,23], labels=[10,22])
start = pd.Timestamp(year=df.index.year.min(), month=1, day=1)
end = pd.Timestamp(year=df.index.year.max(), month=12, day=31)
mux = pd.MultiIndex.from_product([pd.date_range(start, end), [10,22]],
names=['datetime_doy','h'])
df = df.groupby([df.index, 'hourtime'])['sig'].mean().reindex(mux).reset_index(level=1)
df['hourtime'] = df.index + pd.to_timedelta(df.pop('h'), unit='H')
print (df)
sig hourtime
datetime_doy
2007-01-01 NaN 2007-01-01 10:00:00
2007-01-01 NaN 2007-01-01 22:00:00
2007-01-02 -8.963545 2007-01-02 10:00:00
2007-01-02 NaN 2007-01-02 22:00:00
2007-01-03 -8.671357 2007-01-03 10:00:00
... ...
2019-12-29 NaN 2019-12-29 22:00:00
2019-12-30 NaN 2019-12-30 10:00:00
2019-12-30 -8.532572 2019-12-30 22:00:00
2019-12-31 -8.345356 2019-12-31 10:00:00
2019-12-31 NaN 2019-12-31 22:00:00
[9496 rows x 2 columns]