I'm working with the following DataFrame
:
url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.csv'
df=pd.read_csv(url)
df.head()
feccandid fec.dyn feccf cid date_crsp catcode amtsum
0 S8VT00018 NaN NaN N00000534 2005 J2100 2.1
1 S8VT00018 NaN NaN N00000534 2005 L1200 5.0
2 S8VT00018 NaN NaN N00000534 2005 J7300 0.0
4 S8NM00127 0.561 0.382 N00015616 2006 G2900 2.5
5 S8NJ00350 -0.329 NaN N00000854 2005 LG000 7.5
I would like to lag the amtsum
values by year. The following code shows how I derived amtsum
and provides insight into the nature of my desired outcome:
crsp['amtsum']=crsp[['date_crsp', 'cid', 'catcode', 'amount']].\
groupby(['date_crsp', 'catcode','cid']).amount.transform('sum')
crsp['amtsum'] = crsp['amtsum'] / 1000
crsp.drop(['amount'], axis=1, inplace=True)
crsp.drop_duplicates(inplace=True, keep='first')
I have tried using the following code to get lagged values:
crsp['amtsumlag.1']=crsp.groupby(['date_crsp','catcode', 'cid'])['amtsum'].shift(1)
This returns
feccandid fec.dyn feccf cid date_crsp catcode amtsum amtsumlag.1
0 S8VT00018 NaN NaN N00000534 2005 J2100 2.1 NaN
1 S8VT00018 NaN NaN N00000534 2005 L1200 5.0 NaN
2 S8VT00018 NaN NaN N00000534 2005 J7300 0.0 NaN
4 S8NM00127 0.561 0.382 N00015616 2006 G2900 2.5 NaN
5 S8NJ00350 -0.329 NaN N00000854 2005 LG000 7.5 NaN
Since my date_crsp
range is from 2005 to 2014 I expect to have NaN
for 2005 but not for 2006. I have also selected only dates after 2005 and got the same results. Does anyone have a clue on how to solve this problem?
You need to look into your .groupby()
logic.
crsp.groupby(['date_crsp','catcode', 'cid']).size().value_counts()
results in
1 444508
2 5281
3 619
4 3
Most of the groups have only a single value, so there's not much to shift. Unsurprisingly,
crsp['amtsumlag.1'] = crsp.groupby(['catcode', 'cid', 'date_crsp'])['amtsum'].shift(1)
results in:
Data columns (total 8 columns):
feccandid 456939 non-null object
feccandcfscore.dyn 445710 non-null float64
feccandcfscore 355887 non-null float64
cid 456939 non-null object
date_crsp 456939 non-null int64
catcode 456939 non-null object
amtsum 456939 non-null float64
amtsumlag.1 6528 non-null float64
If you want to .shift()
by date_crsp
, for instance, you probably do not want to use it in .groupby()
- each group would then only contain a single year. One would expect to see unique group id
variables in .groupby()
, and ideally a DateTimeIndex
, or otherwise a sorted DataFrame
. So .sort_values('date_crsp')
might be a good idea.
crsp['amtsumlag.1'] = crsp.sort_values('date_crsp').groupby(['catcode', 'cid'])['amtsum'].shift(1)
instead yields:
feccandid 456939 non-null object
feccandcfscore.dyn 445710 non-null float64
feccandcfscore 355887 non-null float64
cid 456939 non-null object
date_crsp 456939 non-null int64
catcode 456939 non-null object
amtsum 456939 non-null float64
amtsumlag.1 301280 non-null float64
dtypes: float64(4), int64(1), object(3)
Without know the data enough to understand what identifies the groups you are trying to lag it's hard to provide a more specific answer.