Search code examples
pythonpandasdataframepanel

Balancing a panel data for regression


I have a dataframe:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})

    id  city     year  value
0   1    abc    2008    10 
1   1    abc    2009    20
2   1    abc    2010    30
3   2   def10   2008    10
4   2   def10   2010    20
5   3   ghk     2009    30

I wanna create a balanced data such that:

    id  city     year  value
0   1    abc    2008    10 
1   1    abc    2009    20
2   1    abc    2010    30
3   2   def10   2008    10
4   2   def10   2009    NaN
5   2   def10   2010    20
6   3   ghk     2008    NaN
7   3   ghk     2009    30
8   3   ghk     2009    NaN

if I use the following code:

df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))

This gives me following error:

cannot handle a non-unique multi-index!

Solution

  • Pivot the table and stack year without drop NaN values:

    >>> df.pivot(["id", "city"], "year", "value") \
          .stack(dropna=False) \
          .rename("value") \
          .reset_index()
    
       id   city  year  value
    0   1    abc  2008   10.0
    1   1    abc  2009   20.0
    2   1    abc  2010   30.0
    3   2  def10  2008   10.0
    4   2  def10  2009    NaN
    5   2  def10  2010   20.0
    6   3    ghk  2008    NaN
    7   3    ghk  2009   30.0
    8   3    ghk  2010    NaN
    

    Edit: case of duplicate entries

    I slightly modified your original dataframe:

    df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
    
    >>> df
       id   city  year  value
    0   1    abc  2008     10
    1   1    abc  2009     20
    2   1    abc  2010     30
    3   2  def10  2008     10
    4   2  def10  2010     20
    5   3    ghk  2009     30  # The problem is here
    6   3    ghk  2009     40  # same (id, city, year)
    

    You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):

    >>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
          .stack(dropna=False) \
          .rename("value") \
          .reset_index()
    
       id   city  year  value
    0   1    abc  2008   10.0
    1   1    abc  2009   20.0
    2   1    abc  2010   30.0
    3   2  def10  2008   10.0
    4   2  def10  2009    NaN
    5   2  def10  2010   20.0
    6   3    ghk  2008    NaN
    7   3    ghk  2009   35.0  # <- mean of (30, 40)
    8   3    ghk  2010    NaN