I have several datasets that are the results of long to wide transformations. Example data:
data1 <- data.frame(
x1 = c(1, NA, NA, NA, NA, NA), # Create example data frame
x2 = c(NA, 1, NA, NA, NA, NA),
x3 = c(NA, NA, NA, 1, 1, NA),
y = 11:16,
z = 6:1)
data2 <- data.frame(
x1 = c(NA, NA, 2, 2, NA, NA), # Create example data frame
x2 = c(NA, NA, NA, NA, 2, NA),
x4 = c(NA, NA, NA, NA, NA, 2),
y = 11:16,
z = 6:1)
data3 <- data.frame(
x2 = c(3, NA, 3, NA, NA, NA), # Create example data frame
x3 = c(NA, 3, NA, NA, NA, NA),
x5 = c(NA, NA, NA, NA, 3, 3),
y = 11:16,
z = 6:1)
I want to merge the datasets, obtaining the following:
datafull
x1 x2 x3 x4 x5 y z
1 1 3 NA NA NA 11 6
2 NA 1 3 NA NA 12 5
3 2 3 NA NA NA 13 4
4 2 NA 1 NA NA 14 3
5 NA 2 1 NA 3 15 2
6 NA NA NA 2 3 16 1
All columns should be present and any valid value present in any dataset should be included in the dedicated column to replace NA in other datasets (e.g., see how x1 and x2).
I tried with
datasetlist <- list(data1, data2, data3)
datafull <- Reduce(function(x, y) merge(x, y, all.y=TRUE), datasetlist)
but what I get is
datafull
x2 y z x3 x1 x4 x5
1 3 11 6 NA NA NA NA
2 3 13 4 NA NA NA NA
3 NA 12 5 3 NA NA NA
4 NA 14 3 NA 2 NA NA
5 NA 15 2 NA NA NA 3
6 NA 16 1 NA NA 2 3
It seems like in the iterative merge done by Reduce, values from the new datasets completely replace old values, also when the new value is a NA.
Any idea on how to avoid this?
Create a list L
of all data frames and get a vector of all names in them, xn
. Then create L2
which is like L
except every component data frame has all columns in xn
such that previously missing columns are filled with NA. Finally apply coalesce
from dplyr.
library(dplyr)
L <- list(data1, data2, data3)
xn <- sort(Reduce(union, lapply(L, names)))
L2 <- lapply(L, \(x) replace(x, xn[! xn %in% names(x)], NA)[xn])
co <- function(x) do.call("coalesce", lapply(L2, "[[", x))
data.frame(Map(co, xn))
giving
x1 x2 x3 x4 x5 y z
1 1 3 NA NA NA 11 6
2 NA 1 3 NA NA 12 5
3 2 3 NA NA NA 13 4
4 2 NA 1 NA NA 14 3
5 NA 2 1 NA 3 15 2
6 NA NA NA 2 3 16 1