I am currently working with a panel data of financial information on pandas, therefore working with different companies across different years. I am trying to generate a column of the $ invested shifted by 2 time periods. Hence, reporting the value of time t also at t+2.
Normally, to lag a variable, I would use df.groupby('tic')['investments'].shift(2)
, however unfortunately my data is a bit spotty and therefore for the same company I might have a gap in the years. Just to give you an idea, here is an example of my df:
datadate fyear tic invest
0 31/12/1998 1997 AAPL 12.3
1 31/12/1999 1998 AAPL 14.5
2 31/12/2002 2002 AAPL 9.7
3 31/12/2003 2003 AAPL 21.8
4 31/12/2004 2004 AAPL 21.4
5 31/12/2005 2005 AAPL 18.9
6 31/05/2008 2008 TSLA 11.5
7 31/05/2009 2009 TSLA 13.7
8 31/05/2010 2010 TSLA 19.4
9 31/05/2011 2011 TSLA 14.5
10 31/05/2014 2013 TSLA 14.8
.. ... .. .. ..
Therefore, what I am trying to achieve is the following:
datadate fyear tic invest l2_invest
0 31/12/1998 1997 AAPL 12.3 NaN
1 31/12/1999 1998 AAPL 14.5 NaN
2 31/12/2002 2002 AAPL 9.7 NaN
3 31/12/2003 2003 AAPL 21.8 NaN
4 31/12/2004 2004 AAPL 21.4 9.7
5 31/12/2005 2005 AAPL 18.9 21.8
6 31/05/2008 2008 TSLA 11.5 NaN
7 31/05/2009 2009 TSLA 13.7 NaN
8 31/05/2010 2010 TSLA 19.4 11.5
9 31/05/2011 2011 TSLA 14.5 13.7
10 31/05/2014 2013 TSLA 14.8 19.4
.. ... .. .. ..
Thank you in advance for your help :)
Provided that the 'datadate' column is the table's index (and of type datetime64), the following code should produce the desired additional column:
df.groupby('tic')['invest'].shift(1, freq=pd.DateOffset(years=2))
Edit: it is still necessary to append this new column to the original table. Here are some more details.
First, make sure the 'datadate' column is of type datetime64:
df['datadate'] = pd.to_datetime(df['datadate'])
Then let's create the new column and append it to our table:
new_column = (df.set_index('datadate')
.groupby('tic')['invest']
.shift(1, freq=pd.DateOffset(years=2)))
df.set_index(['tic', 'datadate'], inplace=True)
df['l2_invest'] = new_column
df.reset_index(inplace=True)
Starting with the original table, this produces
tic datadate fyear invest l2_invest
0 AAPL 1998-12-31 1997 12.3 NaN
1 AAPL 1999-12-31 1998 14.5 NaN
2 AAPL 2002-12-31 2002 9.7 NaN
3 AAPL 2003-12-31 2003 21.8 NaN
4 AAPL 2004-12-31 2004 21.4 9.7
5 AAPL 2005-12-31 2005 18.9 21.8
6 TSLA 2008-05-31 2008 11.5 NaN
7 TSLA 2009-05-31 2009 13.7 NaN
8 TSLA 2010-05-31 2010 19.4 11.5
9 TSLA 2011-05-31 2011 14.5 13.7
10 TSLA 2014-05-31 2013 14.8 NaN