Search code examples
rdplyrtidyversedata-wrangling

Fill up missing values based on other entries on R


I have dataset input with a couple of missing values. and I have to create dataset output with the following logic:

  • If there is a missing in any of the columns b, c, or d, then check the correspondent a column and fill up the missing with the correspondent value from that row to the specific column.

I tried to do that with _join functions from dplyr but was unsuccessful.

I can do it manually, but this option is off the table because I have a big dataset with multiple instances like that.

Input

library(dplyr)

input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
        b = c(1:3, NA, rep(NA,4)),
        c = c(21:28),
        d = c(rep(NA,4), 54, NA, 34,11)) %>% 
  arrange(a)

Input view

# A tibble: 8 × 4
#  a         b     c     d
#  <chr> <int> <int> <dbl>
#1 A         1    21    NA
#2 A        NA    25    54
#3 B         2    22    NA
#4 B        NA    26    NA
#5 C         3    23    NA
#6 C        NA    27    34
#7 D        NA    24    NA
#8 D        NA    28    11

Output - expected view

# A tibble: 8 × 4
#   a         b     c     d
# <chr> <int> <int> <dbl>
# 1 A         1    21    54
# 2 A         1    25    54
# 3 B         2    22    NA
# 4 B         2    26    NA
# 5 C         3    23    34
# 6 C         3    27    34
# 7 D        NA    24    11
# 8 D        NA    28    11

Solution

  • Use function na.locf from package zoo to carry the last observation forward or in the opposite direction.

    suppressPackageStartupMessages(library(dplyr))
    
    input <- tibble( a = rep(c("A", "B", "C", "D"),2 ),
                     b = c(1:3, NA, rep(NA,4)),
                     c = c(21:28),
                     d = c(rep(NA,4), 54, NA, 34,11)) %>% 
      arrange(a)
    
    input %>%
      group_by(a) %>%
      mutate(across(b:d, zoo::na.locf, na.rm =  FALSE)) %>%
      mutate(across(b:d, zoo::na.locf, na.rm =  FALSE, fromLast = TRUE))
    #> # A tibble: 8 × 4
    #> # Groups:   a [4]
    #>   a         b     c     d
    #>   <chr> <int> <int> <dbl>
    #> 1 A         1    21    54
    #> 2 A         1    25    54
    #> 3 B         2    22    NA
    #> 4 B         2    26    NA
    #> 5 C         3    23    34
    #> 6 C         3    27    34
    #> 7 D        NA    24    11
    #> 8 D        NA    28    11
    

    Created on 2022-05-14 by the reprex package (v2.0.1)