Balancing a panel data for regression

I have a dataframe:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})

    id  city     year  value
0   1    abc    2008    10 
1   1    abc    2009    20
2   1    abc    2010    30
3   2   def10   2008    10
4   2   def10   2010    20
5   3   ghk     2009    30

I wanna create a balanced data such that:

    id  city     year  value
0   1    abc    2008    10 
1   1    abc    2009    20
2   1    abc    2010    30
3   2   def10   2008    10
4   2   def10   2009    NaN
5   2   def10   2010    20
6   3   ghk     2008    NaN
7   3   ghk     2009    30
8   3   ghk     2009    NaN

if I use the following code:

df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))

This gives me following error:

cannot handle a non-unique multi-index!

Solution

Pivot the table and stack year without drop NaN values:

>>> df.pivot(["id", "city"], "year", "value") \
      .stack(dropna=False) \
      .rename("value") \
      .reset_index()

   id   city  year  value
0   1    abc  2008   10.0
1   1    abc  2009   20.0
2   1    abc  2010   30.0
3   2  def10  2008   10.0
4   2  def10  2009    NaN
5   2  def10  2010   20.0
6   3    ghk  2008    NaN
7   3    ghk  2009   30.0
8   3    ghk  2010    NaN

Edit: case of duplicate entries

I slightly modified your original dataframe:

df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})

>>> df
   id   city  year  value
0   1    abc  2008     10
1   1    abc  2009     20
2   1    abc  2010     30
3   2  def10  2008     10
4   2  def10  2010     20
5   3    ghk  2009     30  # The problem is here
6   3    ghk  2009     40  # same (id, city, year)

You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):

>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
      .stack(dropna=False) \
      .rename("value") \
      .reset_index()

   id   city  year  value
0   1    abc  2008   10.0
1   1    abc  2009   20.0
2   1    abc  2010   30.0
3   2  def10  2008   10.0
4   2  def10  2009    NaN
5   2  def10  2010   20.0
6   3    ghk  2008    NaN
7   3    ghk  2009   35.0  # <- mean of (30, 40)
8   3    ghk  2010    NaN