Search code examples
rjoincoalesce

R: join values, not variables?


I have 5 versions of the same dataset i.e., all version have the same columns and rows, with the same names. However, they contain values for different cells so that only if I could combine all of them, could I get all the data.

Here is an example:

dataset1 <- as.data.frame(matrix(c("1", "1", "NA", "NA", "NA", "NA", "A", "NA", "B", "B", "A", "B"), ncol = 2))
colnames(dataset1) = c("Patient", "Treatment")
dataset2 <- as.data.frame(matrix(c("1", "1", "2", "4", "3", "NA", "A", "NA", "B", "B", "A", "B"), ncol = 2))
colnames(dataset2) = c("Patient", "Treatment")
dataset3 <- as.data.frame(matrix(c("1", "1", "NA", "NA", "NA", "NA", "A", "NA", "B", "B", "A", "B"), ncol = 2))
colnames(dataset3) = c("Patient", "Treatment")
dataset4 <- as.data.frame(matrix(c("1", "1", "NA", "2", "NA", "NA", "A", "NA", "B", "B", "A", "B"), ncol = 2))
colnames(dataset4) = c("Patient", "Treatment")
dataset5 <- as.data.frame(matrix(c("1", "1", "NA", "2", "NA", "2", "A", "C", "B", "B", "A", "B"), ncol = 2))
colnames(dataset5) = c("Patient", "Treatment")

I would like to combine these 5 datasets in a way so that whichever cells in dataset 1 that is NA is replaced with values from dataset 2, if those cells are valid, or if not from dataset 3, and so forth, so that the result would look like this for the example data:

dataset_complete <- as.data.frame(matrix(c("1", "1", "2", "4", "3", "2", "A", "C", "B", "B", "A", "B"), ncol = 2))
colnames(dataset_complete) = c("Patient", "Treatment")

Is there an automatic way to do this? I tried to read of on join transformations (https://r4ds.hadley.nz/joins.html), but was not able to find a solution.

Kind regards


Solution

  • This can be done with a reduction. Since you mentioned r4ds, I'll use dplyr::coalesce to handle replacing the NA values. Up front, though, "NA" is a string literal, not R's NA. I'll assume you mean the latter.

    dataset1 <- as.data.frame(matrix(c("1", "1", NA, NA, NA, NA, "A", NA, "B", "B", "A", "B"), ncol = 2))
    colnames(dataset1) = c("Patient", "Treatment")
    dataset2 <- as.data.frame(matrix(c("1", "1", "2", "4", "3", NA, "A", NA, "B", "B", "A", "B"), ncol = 2))
    colnames(dataset2) = c("Patient", "Treatment")
    dataset3 <- as.data.frame(matrix(c("1", "1", NA, NA, NA, NA, "A", NA, "B", "B", "A", "B"), ncol = 2))
    colnames(dataset3) = c("Patient", "Treatment")
    dataset4 <- as.data.frame(matrix(c("1", "1", NA, "2", NA, NA, "A", NA, "B", "B", "A", "B"), ncol = 2))
    colnames(dataset4) = c("Patient", "Treatment")
    dataset5 <- as.data.frame(matrix(c("1", "1", NA, "2", NA, "2", "A", "C", "B", "B", "A", "B"), ncol = 2))
    colnames(dataset5) = c("Patient", "Treatment")
    

    Since you reference Hadley Wickham's R4DS, I'll suggest using dplyr::coalesce and purrr::map2_dfc for this, though they are not required, a base R version is easy too.

    Now for the solution:

    Reduce(function(prev, this) purrr::map2_dfc(prev, this, .f = dplyr::coalesce), 
           list(dataset1, dataset2, dataset3, dataset4, dataset5))
    # # A tibble: 6 × 2
    #   Patient Treatment
    #   <chr>   <chr>    
    # 1 1       A        
    # 2 1       C        
    # 3 2       B        
    # 4 4       B        
    # 5 3       A        
    # 6 2       B        
    

    The order of the frames is important. Once a "cell" is non-NA, anything after that will be ignored (which I think is the intent). If we reversed the order of frames, for instance, we get a slightly different result (in Patient):

    Reduce(function(prev, this) purrr::map2_dfc(prev, this, .f = coalesce), list(dataset5, dataset4, dataset3, dataset2, dataset1))
    # # A tibble: 6 × 2
    #   Patient Treatment
    #   <chr>   <chr>    
    # 1 1       A        
    # 2 1       C        
    # 3 2       B        
    # 4 2       B        
    # 5 3       A        
    # 6 2       B