Search code examples
rjoinmergedatasetcoalesce

How do I merge and coalesce several dataframes in R?


I have several datasets that are the results of long to wide transformations. Example data:

data1 <- data.frame(
                   x1 = c(1, NA, NA, NA, NA, NA),          # Create example data frame
                   x2 = c(NA, 1, NA, NA, NA, NA),
                   x3 = c(NA, NA, NA, 1, 1, NA),
                   y = 11:16,
                   z = 6:1)

data2 <- data.frame(
                   x1 = c(NA, NA, 2, 2, NA, NA),          # Create example data frame
                   x2 = c(NA, NA, NA, NA, 2, NA),
                   x4 = c(NA, NA, NA, NA, NA, 2),
                   y = 11:16,
                   z = 6:1)
               
data3 <- data.frame(
                    x2 = c(3, NA, 3, NA, NA, NA),          # Create example data frame
                    x3 = c(NA, 3, NA, NA, NA, NA),
                    x5 = c(NA, NA, NA, NA, 3, 3),
                    y = 11:16,
                    z = 6:1)

  • All datasets have the same number of rows.
  • Some columns are identical in all datasets, like y and z in the example.
  • Other columns are only present in some datasets. Some are found in multiple datasets (as x1, x2, and x3 here), other ones are only found once (as x4 and x5 here). These columns have many NA and a few valid numeric values.

I want to merge the datasets, obtaining the following:

 datafull
  x1 x2 x3 x4 x5  y z
1  1  3 NA NA NA 11 6
2 NA  1  3 NA NA 12 5
3  2  3 NA NA NA 13 4
4  2 NA  1 NA NA 14 3
5 NA  2  1 NA  3 15 2
6 NA NA NA  2  3 16 1

All columns should be present and any valid value present in any dataset should be included in the dedicated column to replace NA in other datasets (e.g., see how x1 and x2).

I tried with

datasetlist <- list(data1, data2, data3)
datafull <- Reduce(function(x, y) merge(x, y, all.y=TRUE), datasetlist)  

but what I get is

datafull
  x2  y z x3 x1 x4 x5
1  3 11 6 NA NA NA NA
2  3 13 4 NA NA NA NA
3 NA 12 5  3 NA NA NA
4 NA 14 3 NA  2 NA NA
5 NA 15 2 NA NA NA  3
6 NA 16 1 NA NA  2  3

It seems like in the iterative merge done by Reduce, values from the new datasets completely replace old values, also when the new value is a NA.

Any idea on how to avoid this?


Solution

  • Create a list L of all data frames and get a vector of all names in them, xn. Then create L2 which is like L except every component data frame has all columns in xn such that previously missing columns are filled with NA. Finally apply coalesce from dplyr.

    library(dplyr)
    
    L <- list(data1, data2, data3)
    xn <- sort(Reduce(union, lapply(L, names)))
    L2 <- lapply(L, \(x) replace(x, xn[! xn %in% names(x)], NA)[xn])
    co <- function(x) do.call("coalesce", lapply(L2, "[[", x))
    data.frame(Map(co, xn))
    

    giving

      x1 x2 x3 x4 x5  y z
    1  1  3 NA NA NA 11 6
    2 NA  1  3 NA NA 12 5
    3  2  3 NA NA NA 13 4
    4  2 NA  1 NA NA 14 3
    5 NA  2  1 NA  3 15 2
    6 NA NA NA  2  3 16 1