Search code examples
pandasmulti-indexreindex

Pandas: Enforcing consistent values for inner index across all outer index values


I have a dataset indexed by entity_id and timestamp, but certain entity_id's do not have entries at all timestamps (not missing values, just no row). I'm trying to enforce consistent timestamps across the entity_ids prior to some complicated NaN handling and resampling. But, I cannot get reindex to create the rows I was expecting, and it is leading to unexpected behavior downstream. My approach was:

import numpy as np
import pandas as pd
df = pd.DataFrame(columns = ["id", "ts", "value"])
df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
df = df.set_index(["id", "ts"])
df

# Grab all the timestamps
timestamps = df.index.get_level_values("ts").unique().sort_values()

# Perform the reindexing
df2 = df.reindex(timestamps, level = 1, axis = 0, fill_value = np.nan)

However, this leaves my dataframe unchanged, i.e., df2 still only has 3 rows. Maybe reindexing isn't the right approach here, but I thought it would work.

Is there a best practice for this sort of operation?

Thank you!


Solution

  • Use:

    #added sample data
    df = pd.DataFrame(columns = ["id", "ts", "value"])
    df.loc[0,:] = [1, pd.Timestamp("2022-01-01 00:00:00"), 1]
    df.loc[1,:] = [1, pd.Timestamp("2022-01-01 00:00:01"), 2]
    df.loc[2,:] = [2, pd.Timestamp("2022-01-01 00:00:00"), 3]
    df.loc[3,:] = [3, pd.Timestamp("2022-01-01 00:00:04"), 4]
    df = df.set_index(["id", "ts"])
    print (df)
                           value
    id ts                       
    1  2022-01-01 00:00:00     1
       2022-01-01 00:00:01     2
    2  2022-01-01 00:00:00     3
    3  2022-01-01 00:00:04     4
    

    If need add missing consecutive datetimes by date_range with minimal and maximal values use MultiIndex.from_product with all ids and dates and pass to DataFrame.reindex:

    dates = pd.date_range(df.index.levels[1].min(), df.index.levels[1].max(), freq='S')
    
    mux = pd.MultiIndex.from_product([df.index.levels[0], dates], names=df.index.names)
    
    out1 = df.reindex(mux)
    print (out1)
                           value
    id ts                       
    1  2022-01-01 00:00:00     1
       2022-01-01 00:00:01     2
       2022-01-01 00:00:02   NaN
       2022-01-01 00:00:03   NaN
       2022-01-01 00:00:04   NaN
    2  2022-01-01 00:00:00     3
       2022-01-01 00:00:01   NaN
       2022-01-01 00:00:02   NaN
       2022-01-01 00:00:03   NaN
       2022-01-01 00:00:04   NaN
    3  2022-01-01 00:00:00   NaN
       2022-01-01 00:00:01   NaN
       2022-01-01 00:00:02   NaN
       2022-01-01 00:00:03   NaN
       2022-01-01 00:00:04     4
    

    If need DataFrame.reindex by unique values of both levels of MultiIndex:

    mux = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
    out2 = df.reindex(mux)
    print (out2)
                           value
    id ts                       
    1  2022-01-01 00:00:00     1
       2022-01-01 00:00:01     2
       2022-01-01 00:00:04   NaN
    2  2022-01-01 00:00:00     3
       2022-01-01 00:00:01   NaN
       2022-01-01 00:00:04   NaN
    3  2022-01-01 00:00:00   NaN
       2022-01-01 00:00:01   NaN
       2022-01-01 00:00:04     4