I have a panel dataset with a set of countries [Italy and US] for 3 years and two numeric variables ['Var1', 'Var2']
. I would like to calculate the rate of change in the last three years Ex: the value for Var1 in 2019 minus the value of Var1 in 2017 divided by Var1 in 2017.
I do not understand why my code (below) returns NaN errors?
data = {'Year':[2017, 2018, 2019, 2017, 2018, 2019], 'Country':['Italy', 'Italy', 'Italy', 'US' , 'US', 'US'], 'Var1':[23,75,45, 32,13,14], 'Var2':[21,75,47, 30,11,18]}
trend = pd.DataFrame(data)
list = ['Var1', 'Var2']
for col in list:
trend[col + ' (3 Year % Change)'] = ((trend.loc[trend['Year']==2019][col]- trend.loc[trend['Year']==2017][col])/trend.loc[trend['Year']==2017][col])*100
trend
There are a few things going wrong here with your code:
pd.series
and not only their arrays, and they are carrying their index, which causes the division to become NaN
.values
after the columns filters, you will bump into a ValueError
because you want two values to insert into the entire DataFrame
and pandas
won't like that (length should be the same). This exemplifies it:trend.loc['Var1' + ' (3 Year % Change)'] = ((trend.loc[trend['Year']==2019, 'Var1'].values - \
trend.loc[trend['Year']==2017, 'Var1'].values)/\
trend.loc[trend['Year']==2017, 'Var1'].values)*100
ValueError: cannot set a row with mismatched columns
list
as an actual variable name, but that is a reserved python
word. It is not the best idea. You can read about it hereIf you want to compare the values with 2017
values in your sample, you can use
groupby
+shift
, based on how many years to shift:
for col in ['Var1','Var2']:
trend[col + ' (3 Year % Change)'] = (trend[col] - trend.groupby('Country').shift(2)[col])/trend.groupby('Country').shift(2)[col]
Out[1]:
Year Country Var1 Var2 Var1 (3 Year % Change) Var2 (3 Year % Change)
0 2017 Italy 23 21 NaN NaN
1 2018 Italy 75 75 NaN NaN
2 2019 Italy 45 47 0.956522 1.238095
3 2017 US 32 30 NaN NaN
4 2018 US 13 11 NaN NaN
5 2019 US 14 18 -0.562500 -0.400000