Search code examples
pythonpandasmulti-indexpanel-data

Pandas first difference panel data with multi index


I have two data frames with the same variables but from different years:

df2016 = pd.DataFrame({"ID": [100,101,102,103], "A": [1,2,3,4], "B": [2,4,6,8], "year": [2016,2016,2016,2016]})

    ID  A   B   year
0   100 1   2   2016
1   101 2   4   2016
2   102 3   6   2016
3   103 4   8   2016

df2017 = pd.DataFrame({"ID": [100,101,102,103], "A": [5,6,7,8], "B": [9,11,13,15], "year": [2017,2017,2017,2017]})

    ID  A   B   year
0   100 5   9   2017
1   101 6   11  2017
2   102 7   13  2017
3   103 8   15  2017

I want to combine these and then first-difference the variable "A" wrt time to create a new variable "delta_A" that gives the change in "A" between 2017 and 2016, for every "ID". The result should be something like

Desired Output

           A    B   delta_A
ID  year            
100 2016    1   2   NaN
101 2016    2   4   NaN
102 2016    3   6   NaN
103 2016    4   8   NaN
100 2017    5   9   4
101 2017    6   11  4
102 2017    7   13  4
103 2017    8   15  4

I want to work within a multi_index setting if possible since I gather that that will be useful for other things. Other answers show how to use series.diff() method to first-difference but do not show how to work with multi-index.

Here's what I tried:

df = pd.concat([df2016, df2017])
df

    ID  A   B   year
0   100 1   2   2016
1   101 2   4   2016
2   102 3   6   2016
3   103 4   8   2016
0   100 5   9   2017
1   101 6   11  2017
2   102 7   13  2017
3   103 8   15  2017

df.set_index(["ID", "year"], inplace=True)
df

            A   B
ID  year        
100 2016    1   2
101 2016    2   4
102 2016    3   6
103 2016    4   8
100 2017    5   9
101 2017    6   11
102 2017    7   13
103 2017    8   15

Actual Output (not desired)

df["delta_A"] = df["A"].diff()
df
            A   B   delta_A
ID  year            
100 2016    1   2   NaN
101 2016    2   4   1.0
102 2016    3   6   1.0
103 2016    4   8   1.0
100 2017    5   9   1.0
101 2017    6   11  1.0
102 2017    7   13  1.0
103 2017    8   15  1.0

This is first-differencing wrt the ID dimension rather than the year dimension. I want it first-difference wrt to the year dimension as in my desired output above.

Another approach would be to create a lagged version of "A" and then substract it from "A", but I'm not sure how to do that within multi-index.

Thanks!


Solution

  • In your given example, you can simply use groupby + diff. Note: This approach will only work if dataframe is sorted by year.

    df['delta_A'] = df.groupby(level=0)['A'].diff()
    

    If you like an alternative way to get the diff by accessing the multiindex values manually, here is another more robust approach:

    s1 = df.loc[(slice(None), 2017), 'A']
    s2 = df.loc[(slice(None), 2016), 'A']
    
    df['delta_A'] = s1.sub(s2.droplevel(1))
    

              A   B  delta_A
    ID  year                
    100 2016  1   2      NaN
    101 2016  2   4      NaN
    102 2016  3   6      NaN
    103 2016  4   8      NaN
    100 2017  5   9      4.0
    101 2017  6  11      4.0
    102 2017  7  13      4.0
    103 2017  8  15      4.0