Search code examples
pythonpandastime-serieslag

Pandas: creating a lagged column with grouped data


I'm working with the following DataFrame:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.csv'

df=pd.read_csv(url)

df.head()

    feccandid   fec.dyn feccf   cid     date_crsp   catcode     amtsum
0   S8VT00018   NaN     NaN     N00000534   2005    J2100        2.1
1   S8VT00018   NaN     NaN     N00000534   2005    L1200        5.0
2   S8VT00018   NaN     NaN     N00000534   2005    J7300        0.0
4   S8NM00127   0.561   0.382   N00015616   2006    G2900        2.5
5   S8NJ00350   -0.329  NaN     N00000854   2005    LG000        7.5

I would like to lag the amtsumvalues by year. The following code shows how I derived amtsumand provides insight into the nature of my desired outcome:

crsp['amtsum']=crsp[['date_crsp', 'cid', 'catcode', 'amount']].\
groupby(['date_crsp', 'catcode','cid']).amount.transform('sum')
crsp['amtsum'] = crsp['amtsum'] / 1000
crsp.drop(['amount'], axis=1, inplace=True)
crsp.drop_duplicates(inplace=True, keep='first')

I have tried using the following code to get lagged values:

crsp['amtsumlag.1']=crsp.groupby(['date_crsp','catcode', 'cid'])['amtsum'].shift(1)

This returns

    feccandid   fec.dyn     feccf   cid     date_crsp   catcode     amtsum  amtsumlag.1
0   S8VT00018   NaN     NaN     N00000534   2005       J2100    2.1     NaN
1   S8VT00018   NaN     NaN     N00000534   2005       L1200    5.0     NaN
2   S8VT00018   NaN     NaN     N00000534   2005       J7300    0.0     NaN
4   S8NM00127   0.561   0.382   N00015616   2006       G2900    2.5     NaN
5   S8NJ00350   -0.329  NaN     N00000854   2005       LG000    7.5     NaN

Since my date_crsprange is from 2005 to 2014 I expect to have NaNfor 2005 but not for 2006. I have also selected only dates after 2005 and got the same results. Does anyone have a clue on how to solve this problem?


Solution

  • You need to look into your .groupby() logic.

    crsp.groupby(['date_crsp','catcode', 'cid']).size().value_counts()
    

    results in

    1    444508
    2      5281
    3       619
    4         3
    

    Most of the groups have only a single value, so there's not much to shift. Unsurprisingly,

    crsp['amtsumlag.1'] = crsp.groupby(['catcode', 'cid', 'date_crsp'])['amtsum'].shift(1)
    

    results in:

    Data columns (total 8 columns):
    feccandid             456939 non-null object
    feccandcfscore.dyn    445710 non-null float64
    feccandcfscore        355887 non-null float64
    cid                   456939 non-null object
    date_crsp             456939 non-null int64
    catcode               456939 non-null object
    amtsum                456939 non-null float64
    amtsumlag.1           6528 non-null float64
    

    If you want to .shift() by date_crsp, for instance, you probably do not want to use it in .groupby() - each group would then only contain a single year. One would expect to see unique group id variables in .groupby(), and ideally a DateTimeIndex, or otherwise a sorted DataFrame. So .sort_values('date_crsp') might be a good idea.

    crsp['amtsumlag.1'] = crsp.sort_values('date_crsp').groupby(['catcode', 'cid'])['amtsum'].shift(1)
    

    instead yields:

    feccandid             456939 non-null object
    feccandcfscore.dyn    445710 non-null float64
    feccandcfscore        355887 non-null float64
    cid                   456939 non-null object
    date_crsp             456939 non-null int64
    catcode               456939 non-null object
    amtsum                456939 non-null float64
    amtsumlag.1           301280 non-null float64
    dtypes: float64(4), int64(1), object(3)
    

    Without know the data enough to understand what identifies the groups you are trying to lag it's hard to provide a more specific answer.