Search code examples
pythonpandas-groupbypanel-data

pandas, Find and keep consecutive rows - create a panel data


I have a DataFrame like below:

df = {'time': [1999,2001,2002,2003,2007,1999,2000,2001,2003,2004],
      'id':['A','A','A','A','A','B','B','B','B','B'],
      'value':[0.1,0.1,0.1,0.1,0.6,0.2,0.2,0.2,0.2,0.2]}
df = pd.DataFrame(df)

I want to create a panel dataset in the id-time level from it, which means, I want something like:

time id  value
0  2001  A    0.1
1  2002  A    0.1
2  2003  A    0.6
3  1999  B    0.2
4  2000  B    0.2
5  2001  B    0.2

Only consecutive rows of each id are left, I can finish this with just several lines in R,

df<-df %>% 
    mutate(time = as.integer(time)) %>% 
    group_by(gvkey, grp = cumsum(c(1, diff(time) != 1))) %>% 
    filter(n() >= consec_obs)
df<-df[,setdiff(colnames(df),c('grp'))]

in which consec_obs is the minimum of consecutive rows one wants to keep.

I searched for a while but couldn't find a solution, which surprised me a little bit as this is kind of basic econometric analysis manipulation, anyone knows how to do this with Python?


Solution

  • Mimicking the R solution, I come up with a Python version on Sunday night, here it is:

    # lag where two rows within each group are not conesecutive
    df['diff'] = df.groupby('id')['time'].diff()!=1
    # cumulative summation
    df['cusm'] = df.groupby('id')['diff'].cumsum()
    # group by 'id' and 'cusm', then select those rows which satisfy prespecified condition
    df.loc[df.groupby(['id','cusm']).transform('count')['diff'] >=3].drop(['diff','cusm'],axis=1)
    

    If this seems a liitle hard to understand, try the code on a one line basis, you wiil get there.

    Could it be possible to merge the first two line as one?