Search code examples
pythonpandasdatetimegroup-byreindex

Looking to reindex dates based on a groupby of Grid Cell


I am attempting to do something similar to the reindexing in Pandas reindex date index by group revisited

except I have one extra level of indexing involved. My data is in the following structure: [Grid Cell, Site, Date, Value], and I want to reindex all sites within a particular grid cell to the same datetimeindex.

import pandas as pd
import numpy as np

data_dict = {'Grid Cell':[1,1,1,1,1,1,1,2,2,2,2,2,2,2],'Site':['A','A','A','A','B','B','B','C','C','C','D','D','D','D'],'Date':['1999-01-01','1999-02-01','1999-03-01','1999-04-01','1999-01-01','1999-02-01','1999-03-01','2000-01-01','2000-02-01','2000-03-01','2000-01-01','2000-02-01','2000-03-01','2000-04-01'],'Value':[-2.45,-3.72,1.34,4.56,0.23,3.26,6.76,-7.45,-6.43,-2.18,-10.72,-8.97,-5.32,-1.73]}
df = pd.DataFrame.from_dict(data_dict)

unique_dates = df.groupby('Grid Cell')['Date'].unique()

df = df.set_index('Date')
idx = pd.MultiIndex.from_product([df.index.unique(), df['Grid Cell'].unique()],names=['Date','Grid Cell'])

new_df = df.set_index("Grid Cell", append=True).reindex(idx, fill_value=np.nan).reset_index(level=1)

I am attempting to achieve a similar result to the linked example, such that all sites in Grid Cell 1, would be reindexed to the following datetimeindex: '1999-01-01', '1999-02-01', '1999-03-01', '1999-04-01', and all sites in Grid Cell 2 would be reindexed to: '2000-01-01', '2000-02-01', '2000-03-01', '2000-04-01', and the resultant dataframe would look like:

data_dict_out = {'Grid Cell':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],'Site':['A','A','A','A','B','B','B','B','C','C','C','D','D','D','D'],'Date':['1999-01-01','1999-02-01','1999-03-01','1999-04-01','1999-01-01','1999-02-01','1999-03-01','1999-04-01','2000-01-01','2000-02-01','2000-03-01','2000-04-01','2000-01-01','2000-02-01','2000-03-01','2000-04-01'],'Value':[-2.45,-3.72,1.34,4.56,0.23,3.26,6.76,np.nan,-7.45,-6.43,-2.18,np.nan,-10.72,-8.97,-5.32,-1.73]}

df_out = pd.DataFrame.from_dict(data_dict_out)

Unfortunately my partial solution currently produces a "ValueError: cannot handle a non-unique multi-index!" error. Any suggestions on what I need to change here?


Solution

  • IIUC, you can do:

    def reindex(g):
        idx = pd.MultiIndex.from_product(
            [g["Grid Cell"].unique(), g["Site"].unique(), g["Date"].unique()],
            names=["Grid Cell", "Site", "Date"],
        )
        return g.set_index(["Grid Cell", "Site", "Date"]).reindex(idx, fill_value=np.nan)
    
    
    df = df.groupby("Grid Cell", group_keys=False).apply(reindex).reset_index()
    print(df)
    

    Prints:

        Grid Cell Site        Date  Value
    0           1    A  1999-01-01  -2.45
    1           1    A  1999-02-01  -3.72
    2           1    A  1999-03-01   1.34
    3           1    A  1999-04-01   4.56
    4           1    B  1999-01-01   0.23
    5           1    B  1999-02-01   3.26
    6           1    B  1999-03-01   6.76
    7           1    B  1999-04-01    NaN
    8           2    C  2000-01-01  -7.45
    9           2    C  2000-02-01  -6.43
    10          2    C  2000-03-01  -2.18
    11          2    C  2000-04-01    NaN
    12          2    D  2000-01-01 -10.72
    13          2    D  2000-02-01  -8.97
    14          2    D  2000-03-01  -5.32
    15          2    D  2000-04-01  -1.73