I have an inventor-year panel. Inventors can work for different firms during the sample, but we only observe where they work when they produce a patent. Otherwise, workplace is missing.
The assumption is that the midpoint between known workplaces is when they moved, e.g if Inventor A works for Firm A in Year 1 and then Firm B in Year 6, we assign Inventor A to Firm A in Years 2 and 3, and then Firm B for Years 4 and 5. This illustrates the point for one job change, but there could also be a Firm C in Year 9, so I would want to assign Inventor A to Firm B in Year 7 and Firm C in Year 8. I would like to do this on a large dataset of 1.25M inventors and 6.3M inventor-years, so I included Inventor B who is only in the sample for 3 years, but changes jobs once.
Not all inventors change jobs, so I could parse out those that do and work on them separately, while just using the fill command for those that never change jobs.
Data I have
have <- data.frame(
inventor_id=c('A','A','A','A','A','A','A','A','A','B','B','B'),
firm_id=c('A',NA,NA,NA,NA,'B',NA,NA,'C','D',NA,'E'),
fyear=c("2001", "2002","2003","2004","2005","2006","2007","2008","2009", "2005", "2006", "2007")
)
Data I want:
want <- data.frame(
inventor_id=c('A','A','A','A','A','A','A','A','A','B','B','B'),
firm_id=c('A','A','A','B','B','B','B','C','C','D','E', 'E'),
fyear=c("2001", "2002","2003","2004","2005","2006","2007","2008","2009", "2005", "2006", "2007")
)
Any help would be greatly appreciated,
in Base R you could do:
na_fill<- function(x){
idx <- is.na(x)
a <- rle(idx)
end <- cumsum(a$lengths)[a$values]
len <- a$lengths[a$values]
mid <- len %/% 2
x[idx] <- rep(x[t(cbind(end-len, end + 1))], t(cbind(mid, len - mid)))
x
}
transform(have, firm_id = na_fill(firm_id))
inventor_id firm_id fyear
1 A A 2001
2 A A 2002
3 A A 2003
4 A B 2004
5 A B 2005
6 A B 2006
7 A B 2007
8 A C 2008
9 A C 2009
10 B D 2005
11 B E 2006
12 B E 2007
Note that you do not have to use the groups if the group cannot start and cannot end with NA
but if it does, then use group_by
or ave