Search code examples
rtime-seriesnaimputation

Impute NA of a Categorical Variable based on the Timing of Appearance


I have a dataset like below, with unique firm-year observations. But the variable IndustryCode has some NA due to the merger earlier.

stkcd date industrycode
10    2002   .
10    2003   .
10    2004   E22
10    2005   E22
10    2006   E22
10    2007   E22
10    2008   G45
10    2009   G45
10    2010   .
10    2011   .
11    2001   .
11    2002   .
11    2003   D23
11    2004   D23
....

I want to impute NA with the value from the closest year of the same firm. For example for firm 10(stkcd=10), IndustryCode in the years before 2004 is changed to E22, the value from 2004, and NAs in the years after 2009 is replaced by G45, the value from 2009.

How can I achieve this in R?


Solution

  • How about apply two times the na.locf function from zoo?

    Basically it substitutes NA with the last value, you'll need to go backwards too.

    Some data:

    dat <- data.frame(
      stkcd = rep(10, 10),
      year = 2002:2011,
      type = c(NA,NA, "E22", "E22","E22", "E22", "G45", "G45", NA, NA)
    )
    
    
    library(zoo)
    dat$type <- na.locf(dat$type, na.rm = F) # computes NA with the last value found
    dat$type <- na.locf(dat$type, na.rm = F, fromLast = T) # this because you start with NAs, so you need to go backwards too
    
    # output:
    
    # stkcd year type
    # 1     10 2002  E22
    # 2     10 2003  E22
    # 3     10 2004  E22
    # 4     10 2005  E22
    # 5     10 2006  E22
    # 6     10 2007  E22
    # 7     10 2008  G45
    # 8     10 2009  G45
    # 9     10 2010  G45
    # 10    10 2011  G45
    

    If you have multiple company, you need to group_by first, from dplyr:

    library(dplyr)
    library(zoo)
    
    dat %>%
        group_by(stkcd) %>% # the variable used for the company name
        mutate(type = na.locf(type, na.rm = F),
               type = na.locf(type, na.rm = F, fromLast = T)) 
    

    For example with 2 companies:

    dat <- data.frame(
      stkcd = c(rep(10, 10), rep(20,10)),
      year = rep(2002:2011, 2),
      type = c(NA,NA, "E22", "E22","E22", "E22", "G45", "G45", NA, NA,
               NA,NA, "E22", "E22","E22", "E22", "G45", "G45", NA, NA)
    )
    
    dat %>%
      group_by(stkcd) %>% # the variable used for the company name
      mutate(type = na.locf(type, na.rm = F),
             type = na.locf(type, na.rm = F, fromLast = T)) 
    
    # A tibble: 20 x 3
    # Groups:   stkcd [2]
    # stkcd  year type 
    # <dbl> <int> <fct>
    #   1    10  2002 E22  
    # 2    10  2003 E22  
    # 3    10  2004 E22  
    # 4    10  2005 E22  
    # 5    10  2006 E22  
    # 6    10  2007 E22  
    # 7    10  2008 G45  
    # 8    10  2009 G45  
    # 9    10  2010 G45  
    # 10    10  2011 G45  
    # 11    20  2002 E22  
    # 12    20  2003 E22  
    # 13    20  2004 E22  
    # 14    20  2005 E22  
    # 15    20  2006 E22  
    # 16    20  2007 E22  
    # 17    20  2008 G45  
    # 18    20  2009 G45  
    # 19    20  2010 G45  
    # 20    20  2011 G45