Search code examples
rmissing-datapanel-data

R: Fill missing values with lag forward to midway between two known values and the lead backwards to the midway


I have an inventor-year panel. Inventors can work for different firms during the sample, but we only observe where they work when they produce a patent. Otherwise, workplace is missing.

The assumption is that the midpoint between known workplaces is when they moved, e.g if Inventor A works for Firm A in Year 1 and then Firm B in Year 6, we assign Inventor A to Firm A in Years 2 and 3, and then Firm B for Years 4 and 5. This illustrates the point for one job change, but there could also be a Firm C in Year 9, so I would want to assign Inventor A to Firm B in Year 7 and Firm C in Year 8. I would like to do this on a large dataset of 1.25M inventors and 6.3M inventor-years, so I included Inventor B who is only in the sample for 3 years, but changes jobs once.

Not all inventors change jobs, so I could parse out those that do and work on them separately, while just using the fill command for those that never change jobs.

Data I have

have <- data.frame(
  inventor_id=c('A','A','A','A','A','A','A','A','A','B','B','B'),
  firm_id=c('A',NA,NA,NA,NA,'B',NA,NA,'C','D',NA,'E'),
  fyear=c("2001", "2002","2003","2004","2005","2006","2007","2008","2009", "2005", "2006", "2007")
  )

Data I want:

want <- data.frame(
  inventor_id=c('A','A','A','A','A','A','A','A','A','B','B','B'),
  firm_id=c('A','A','A','B','B','B','B','C','C','D','E', 'E'),
  fyear=c("2001", "2002","2003","2004","2005","2006","2007","2008","2009", "2005", "2006", "2007")
)

Any help would be greatly appreciated,


Solution

  • in Base R you could do:

    na_fill<- function(x){
      idx <- is.na(x)
      a <- rle(idx)
      end <- cumsum(a$lengths)[a$values] 
      len <- a$lengths[a$values] 
      mid <-  len %/% 2
      x[idx] <- rep(x[t(cbind(end-len, end + 1))], t(cbind(mid, len - mid)))
      x
    }
    
    transform(have, firm_id = na_fill(firm_id))
    
       inventor_id firm_id fyear
    1            A       A  2001
    2            A       A  2002
    3            A       A  2003
    4            A       B  2004
    5            A       B  2005
    6            A       B  2006
    7            A       B  2007
    8            A       C  2008
    9            A       C  2009
    10           B       D  2005
    11           B       E  2006
    12           B       E  2007
    

    Note that you do not have to use the groups if the group cannot start and cannot end with NA but if it does, then use group_by or ave