R: Fill missing values with lag forward to midway between two known values and the lead backwards to the midway

I have an inventor-year panel. Inventors can work for different firms during the sample, but we only observe where they work when they produce a patent. Otherwise, workplace is missing.

The assumption is that the midpoint between known workplaces is when they moved, e.g if Inventor A works for Firm A in Year 1 and then Firm B in Year 6, we assign Inventor A to Firm A in Years 2 and 3, and then Firm B for Years 4 and 5. This illustrates the point for one job change, but there could also be a Firm C in Year 9, so I would want to assign Inventor A to Firm B in Year 7 and Firm C in Year 8. I would like to do this on a large dataset of 1.25M inventors and 6.3M inventor-years, so I included Inventor B who is only in the sample for 3 years, but changes jobs once.

Not all inventors change jobs, so I could parse out those that do and work on them separately, while just using the fill command for those that never change jobs.

Data I have

have <- data.frame(
  inventor_id=c('A','A','A','A','A','A','A','A','A','B','B','B'),
  firm_id=c('A',NA,NA,NA,NA,'B',NA,NA,'C','D',NA,'E'),
  fyear=c("2001", "2002","2003","2004","2005","2006","2007","2008","2009", "2005", "2006", "2007")
  )

Data I want:

want <- data.frame(
  inventor_id=c('A','A','A','A','A','A','A','A','A','B','B','B'),
  firm_id=c('A','A','A','B','B','B','B','C','C','D','E', 'E'),
  fyear=c("2001", "2002","2003","2004","2005","2006","2007","2008","2009", "2005", "2006", "2007")
)

Any help would be greatly appreciated,

Solution

in Base R you could do:

na_fill<- function(x){
  idx <- is.na(x)
  a <- rle(idx)
  end <- cumsum(a$lengths)[a$values] 
  len <- a$lengths[a$values] 
  mid <-  len %/% 2
  x[idx] <- rep(x[t(cbind(end-len, end + 1))], t(cbind(mid, len - mid)))
  x
}

transform(have, firm_id = na_fill(firm_id))

   inventor_id firm_id fyear
1            A       A  2001
2            A       A  2002
3            A       A  2003
4            A       B  2004
5            A       B  2005
6            A       B  2006
7            A       B  2007
8            A       C  2008
9            A       C  2009
10           B       D  2005
11           B       E  2006
12           B       E  2007

Note that you do not have to use the groups if the group cannot start and cannot end with NA but if it does, then use group_by or ave