I have a DataFrame like below:
df = {'time': [1999,2001,2002,2003,2007,1999,2000,2001,2003,2004],
'id':['A','A','A','A','A','B','B','B','B','B'],
'value':[0.1,0.1,0.1,0.1,0.6,0.2,0.2,0.2,0.2,0.2]}
df = pd.DataFrame(df)
I want to create a panel dataset in the id-time
level from it, which means, I want something like:
time id value
0 2001 A 0.1
1 2002 A 0.1
2 2003 A 0.6
3 1999 B 0.2
4 2000 B 0.2
5 2001 B 0.2
Only consecutive rows of each id
are left, I can finish this with just several lines in R,
df<-df %>%
mutate(time = as.integer(time)) %>%
group_by(gvkey, grp = cumsum(c(1, diff(time) != 1))) %>%
filter(n() >= consec_obs)
df<-df[,setdiff(colnames(df),c('grp'))]
in which consec_obs
is the minimum of consecutive rows one wants to keep.
I searched for a while but couldn't find a solution, which surprised me a little bit as this is kind of basic econometric analysis manipulation, anyone knows how to do this with Python?
Mimicking the R solution, I come up with a Python version on Sunday night, here it is:
# lag where two rows within each group are not conesecutive
df['diff'] = df.groupby('id')['time'].diff()!=1
# cumulative summation
df['cusm'] = df.groupby('id')['diff'].cumsum()
# group by 'id' and 'cusm', then select those rows which satisfy prespecified condition
df.loc[df.groupby(['id','cusm']).transform('count')['diff'] >=3].drop(['diff','cusm'],axis=1)
If this seems a liitle hard to understand, try the code on a one line basis, you wiil get there.
Could it be possible to merge the first two line as one?