Search code examples
pythonpandasdataframegroup-bypandas-resample

Pandas: calculate the morning averaged values or afternoon averaged values


I got a dataframe like this:

gpi_data[['sig','hourtime']]
Out[28]: 
                   sig                   hourtime
datetime_doy                                     
2007-01-02   -8.963545 2007-01-02 09:20:11.249998
2007-01-03   -8.671357 2007-01-03 10:39:31.874991
2007-01-03   -8.996480 2007-01-03 20:22:59.999006
2007-01-04   -8.835958 2007-01-04 10:18:56.249024
2007-01-05   -8.785034 2007-01-05 21:21:39.374002
               ...                        ...
2019-12-30   -8.529724 2019-12-30 20:23:01.874996
2019-12-30   -8.563781 2019-12-30 20:48:28.125016
2019-12-30   -8.504211 2019-12-30 21:23:44.999996
2019-12-31   -8.460620 2019-12-31 09:39:31.873999
2019-12-31   -8.230092 2019-12-31 10:18:58.125014

[7983 rows x 2 columns]

and I want to calculate the averaged values of each morning and each afternoon based on hour time. By morning I mean the data is observed around 10:00:00, and 22:00:00 for afternoon. If there is no values on the morning/evening on this day, fill it with np.nan. For example, on 2007-01-01 we don't have any morning or evening values of sig. Then we fill it with two np.nan values. Then on 2007-01-02 we only have morning value, so we fill the evening value of 2007-01-02 with np.nan. SPECIFICALLY, for 2019-12-30, we have 3 evening values which are 2019-12-30 20:23:01.874996, 2019-12-30 20:48:28.125016 and 2019-12-30 21:23:44.999996. So we need to calculate the average value of -8.529724, -8.563781 and -8.504211. It's same for the last two datapoints on the morning of 2019-12-31, we need to average them, and fill the np.nan to the evening of 2019-12-31.

So ideally the final result would be:

gpi_data[['sig','hourtime']]
Out[28]: 
                   sig                   hourtime
datetime_doy                                     
2007-01-01   nan       2007-01-01 10:00:00
2007-01-01   nan       2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 09:20:11.249998
2007-01-02   nan       2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:39:31.874991
2007-01-03   -8.996480 2007-01-03 20:22:59.999006
2007-01-04   -8.835958 2007-01-04 10:18:56.249024
2007-01-04   nan       2007-01-04 22:00:00
2007-01-05   nan       2007-01-05 10:00:00
2007-01-05   -8.785034 2007-01-05 21:21:39.374002
               ...                        ...
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.345356 2019-12-31 09:39:31.873999
2019-12-31   nan       2019-12-31 22:00:00

It's fine if we round all hourtime to 10:00:00 or 22:00:00 like below:

gpi_data[['sig','hourtime']]
Out[28]: 
                   sig                   hourtime
datetime_doy                                     
2007-01-01   nan       2007-01-01 10:00:00
2007-01-01   nan       2007-01-01 22:00:00
2007-01-02   -8.963545 2007-01-02 10:00:00
2007-01-02   nan       2007-01-02 22:00:00
2007-01-03   -8.671357 2007-01-03 10:00:00
2007-01-03   -8.996480 2007-01-03 22:00:00
2007-01-04   -8.835958 2007-01-04 10:00:00
2007-01-04   nan       2007-01-04 22:00:00
2007-01-05   nan       2007-01-05 10:00:00
2007-01-05   -8.785034 2007-01-05 22:00:00
               ...                        ...
2019-12-30   -8.532572 2019-12-30 22:00:00
2019-12-31   -8.460620 2019-12-31 10:00:00
2019-12-31   nan       2019-12-31 22:00:00

How can I do it? is there anybody who can help me? Thanks!


Solution

  • Use cut for defined 10 and 22 column by some thresholds, here is used 12 and 23 hours.

    Then create MultiIndex by minimal and maximal years in MultiIndex.from_product, aggregate mean and add missing combinations by Series.reindex, last create hourtime column:

    df['hourtime'] = pd.cut(df['hourtime'].dt.hour, bins=[0,12,23], labels=[10,22])
    
    start = pd.Timestamp(year=df.index.year.min(), month=1, day=1)
    end = pd.Timestamp(year=df.index.year.max(), month=12, day=31)
    mux = pd.MultiIndex.from_product([pd.date_range(start, end), [10,22]],
                                      names=['datetime_doy','h'])
    
    df = df.groupby([df.index, 'hourtime'])['sig'].mean().reindex(mux).reset_index(level=1)
    df['hourtime'] = df.index + pd.to_timedelta(df.pop('h'), unit='H')
    print (df)
                       sig            hourtime
    datetime_doy                              
    2007-01-01         NaN 2007-01-01 10:00:00
    2007-01-01         NaN 2007-01-01 22:00:00
    2007-01-02   -8.963545 2007-01-02 10:00:00
    2007-01-02         NaN 2007-01-02 22:00:00
    2007-01-03   -8.671357 2007-01-03 10:00:00
                   ...                 ...
    2019-12-29         NaN 2019-12-29 22:00:00
    2019-12-30         NaN 2019-12-30 10:00:00
    2019-12-30   -8.532572 2019-12-30 22:00:00
    2019-12-31   -8.345356 2019-12-31 10:00:00
    2019-12-31         NaN 2019-12-31 22:00:00
    
    [9496 rows x 2 columns]