Search code examples
rjoinreplacecoalesce

How to replace columns with NA in a tibble with imputed columns from another tibble in R


I want to replace de columns with NA in df using the imputed values in df2 to get df3. I can do it with left_join and coalesce, but I think this method doesn't generalize well. Is there a better way?

library(tidyverse)

df <- tibble(c = c("a", "a", "a", "b", "b", "b"),
             d = c(1, 2, 3, 1, 2, 3),
             x = c(1, NA, 3, 4, 5,6),
             y = c(1, 2, NA, 4, 5, 6),
             z = c(1, 2, 7, 4, 5, 6))

# I want to replace NA in df by df2

df2 <- tibble(c = c("a", "a", "a"),
             d = c(1, 2, 3),
             x = c(1, 2, 3),
             y = c(1, 2, 2))

# to get

df3 <- tibble(c = c("a", "a", "a", "b", "b", "b"),
             d = c(1, 2, 3, 1, 2, 3),
             x = c(1, 2, 3, 4, 5, 6),
             y = c(1, 2, 2, 4, 5, 6),
             z = c(1, 2, 7, 4, 5, 6))

# is there a better solution than coalesce?

df3 <- df %>% left_join(df2, by = c("c", "d")) %>%
  mutate(x = coalesce(x.x, x.y),
         y = coalesce(y.x, y.y)) %>%
  select(-x.x, -x.y, -y.x, -y.y)
Created on 2021-06-17 by the reprex package (v2.0.0)

Solution

  • Here's a custom function that coalesces all .x and .y columns, optionally renaming and removing columns.

    #' Coalesce all columns duplicated in a previous join.
    #'
    #' Find all columns resulting from duplicate names after a join
    #' operation (e.g., `dplyr::*_join` or `base::merge`), then coalesce
    #' them pairwise.
    #'
    #' @param x data.frame
    #' @param suffix character, length 2, the same string suffixes
    #'   appended to column names of duplicate columns; should be the same
    #'   as provided to `dplyr::*_join(., suffix=)` or `base::merge(.,
    #'   suffixes=)`
    #' @param clean logical, whether to remove the suffixes from the LHS
    #'   columns and remove the columns on the RHS columns
    #' @param strict logical, whether to enforce same-classes in the LHS
    #'   (".x") and RHS (".y") columns; while it is safer to set this to
    #'   true (default), sometimes the conversion of classes might be
    #'   acceptable, for instance, if one '.x' column is 'numeric' and its
    #'   corresponding '.y' column is 'integer', then relaxing the class
    #'   requirement might be acceptable
    #' @return 'x', coalesced, optionally cleaned
    #' @export
    coalesce_all <- function(x, suffix = c(".x", ".y"),
                             clean = FALSE, strict = TRUE) {
      nms <- colnames(x)
      Xs <- endsWith(nms, suffix[1])
      Ys <- endsWith(nms, suffix[2])
      # x[Xs] <- Map(dplyr::coalesce, x[Xs], x[Ys])
      # x[Xs] <- Map(data.table::fcoalesce, x[Xs], x[Ys])
      x[Xs] <- Map(function(dotx, doty) {
        if (strict) stopifnot(identical(class(dotx), class(doty)))
        isna <- is.na(dotx)
        replace(dotx, isna, doty[isna])
      } , x[Xs], x[Ys])
      if (clean) {
        names(x)[Xs] <- gsub(glob2rx(paste0("*", suffix[1]), trim.head = TRUE), "", nms[Xs])
        x[Ys] <- NULL
      }
      x
    }
    

    In action:

    df %>%
      left_join(df2, by = c("c", "d")) %>%
      coalesce_all()
    # # A tibble: 6 x 7
    #   c         d   x.x   y.x     z   x.y   y.y
    #   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    # 1 a         1     1     1     1     1     1
    # 2 a         2     2     2     2     2     2
    # 3 a         3     3     2     7     3     2
    # 4 b         1     4     4     4    NA    NA
    # 5 b         2     5     5     5    NA    NA
    # 6 b         3     6     6     6    NA    NA
    
    df %>%
      left_join(df2, by = c("c", "d")) %>%
      coalesce_all(clean = TRUE)
    # # A tibble: 6 x 5
    #   c         d     x     y     z
    #   <chr> <dbl> <dbl> <dbl> <dbl>
    # 1 a         1     1     1     1
    # 2 a         2     2     2     2
    # 3 a         3     3     2     7
    # 4 b         1     4     4     4
    # 5 b         2     5     5     5
    # 6 b         3     6     6     6
    

    I included two coalesce functions as alternatives to the base-R within the Map. One advantage is the strict argument: dplyr::coalesce will silently allow integer and numeric to be coalesced, while data.table::fcoalesce does not. If that is desirable, use what you prefer. (Another advantage is that both of the non-base coalesce functions accept an arbitrary number of columns to coalesce, which is not required in this implementation.)