Search code examples
pythonpandastime-seriesmissing-datareindex

Pandas-Add missing years in time series data with duplicate years


I have a dataset like this where data for some years are missing .

County Year Pop
12     1999 1.1
12     2001 1.2
13     1999 1.0
13     2000 1.1

I want something like

County Year Pop
12     1999 1.1
12     2000 NaN
12     2001 1.2
13     1999 1.0
13     2000 1.1
13     2001 nan

I have tried setting index to year and then using reindex with another dataframe of just years method (mentioned here Pandas: Add data for missing months) but it gives me error cant reindex with duplicate values. I have also tried df.loc but it has same issue. I even tried a full outer join with blank df of just years but that also didnt work.

How can I solve this?


Solution

  • Make a MultiIndex so you don't have duplicates:

    df.set_index(['County', 'Year'], inplace=True)
    

    Then construct a full MultiIndex with all the combinations:

    index = pd.MultiIndex.from_product(df.index.levels)
    

    Then reindex:

    df.reindex(index)
    

    The construction of the MultiIndex is untested and may need a little tweaking (e.g. if a year is entirely absent from all counties), but I think you get the idea.