Search code examples
rdplyrtidyversecoalesce

Tidyverse: Replacing NAs with latest non-NA values *using tidyverse tools*


My question has been answered before using zoo:: and data.table::; I'm curious as to what the best solution with tidyverse/dplyr would be.

Previous answers (non-tidyverse): Forward and backward fill data frame in R Replacing NAs with latest non-NA value

My data looks like this, where the earliest two years (2015, 2016) in each country (usa, aus) have missing data (code for data input at the bottom):

#>   country year value
#> 1     usa 2015    NA
#> 2     usa 2016    NA
#> 3     usa 2017   100
#> 4     usa 2018    NA
#> 5     aus 2015    NA
#> 6     aus 2016    NA
#> 7     aus 2017    50
#> 8     aus 2018    60

I would like to fill the missing values, within each country, with the value available in 2017.

I would like that fill to only be for the years prior to 2017--so an NA in 2018 should not be filled in by anything. It should remain NA.

So my desired output is:

#>   country year value
#> 1     usa 2015   100
#> 2     usa 2016   100
#> 3     usa 2017   100
#> 4     usa 2018    NA
#> 5     aus 2015    50
#> 6     aus 2016    50
#> 7     aus 2017    50
#> 8     aus 2018    60

I tried group_by(country) and then I suspect I'm meant to use coalesce(), but I normally use coalesce across vectors, not along them.

library(tidyverse)
df %>% group_by(country) %>% 

What's the easiest way to do this using tidyverse tools?

Code for Data Input:

#install.packages("datapasta")
df <- data.frame(
  stringsAsFactors = FALSE,
           country = c("usa", "usa", "usa", "usa", "aus", "aus", "aus", "aus"),
              year = c(2015L, 2016L, 2017L, 2018L, 2015L, 2016L, 2017L, 2018L),
             value = c(NA, NA, 100L, NA, NA, NA, 50L, 60L)
)
df


Solution

  • We can replace the NAs before 2017 with value available in 2017 year for each country.

    library(dplyr)
    
    df %>% 
      group_by(country) %>% 
      mutate(value = replace(value, is.na(value) & year < 2017, value[year == 2017]))
      #Similarly with ifelse
      #mutate(value = ifelse(is.na(value) & year < 2017, value[year == 2017], value))
    
    #  country  year value
    #  <chr>   <int> <int>
    #1 usa      2015   100
    #2 usa      2016   100
    #3 usa      2017   100
    #4 usa      2018    NA
    #5 aus      2015    50
    #6 aus      2016    50
    #7 aus      2017    50
    #8 aus      2018    60