I am working with a large panel data of financial info, however the values are a bit spotty. I am trying to calculate the return between each year of each stock in my panel data. However, because of missing values sometimes firms have year gaps, making the: df['stock_ret'] = df.groupby(['tic'])['stock_price'].pct_change()
impossible to practice as it would be wrong. The df looks something like this (just giving an example):
datadate month fyear ticker price
0 31/12/1998 12 1998 AAPL 188.92
1 31/12/1999 12 1999 AAPL 197.44
2 31/12/2002 12 2002 AAPL 268.13
3 31/12/2003 12 2003 AAPL 278.06
4 31/12/2004 12 2004 AAPL 288.35
5 31/12/2005 12 2005 AAPL 312.23
6 31/05/2008 5 2008 TSLA 45.67
7 31/05/2009 5 2009 TSLA 38.29
8 31/05/2010 5 2010 TSLA 42.89
9 31/05/2011 5 2011 TSLA 56.03
10 31/05/2014 5 2014 TSLA 103.45
.. ... .. .. .. ..
What I am looking for is a piece of code that would allow me to understand (for each individual firm) if there is any gap in the data, and calculate returns for the two different series. Just like this:
datadate month fyear ticker price return
0 31/12/1998 12 1998 AAPL 188.92 NaN
1 31/12/1999 12 1999 AAPL 197.44 0.0451
2 31/12/2002 12 2002 AAPL 268.13 NaN
3 31/12/2003 12 2003 AAPL 278.06 0.0370
4 31/12/2004 12 2004 AAPL 288.35 0.0370
5 31/12/2005 12 2005 AAPL 312.23 0.0828
6 31/05/2008 5 2008 TSLA 45.67 NaN
7 31/05/2009 5 2009 TSLA 38.29 -0.1616
8 31/05/2010 5 2010 TSLA 42.89 0.1201
9 31/05/2011 5 2011 TSLA 56.03 0.3063
10 31/05/2014 5 2014 TSLA 103.45 NaN
.. ... .. .. .. ..
If you have any other suggestions on how to treat this problem, please feel free to share your knowledge :) I am a bit inexperienced so I am sure that your advice could help!
Thank you in advance guys!
You can create a mask that tells if the last year existed and just update those years with pct change:
df['return'] = np.nan
mask = df.groupby('ticker')['fyear'].apply(lambda x: x.shift(1)==x-1)
df.loc[mask,'return'] = df.groupby('ticker')['price'].pct_change()