Search code examples
rdataframedplyrpurrrcoalesce

Coalesce pairs of variables within a dataframe based on a regular expression


I want to use dplyr::coalesce to find the first non-missing value between pairs of variables in a dataframe containing multiple pairs of variable. The goal is to create a new dataframe with now only one copy for each pair of variable (a coalesce variable without NA values).

Here is an example:

df <- data.frame(
      A_1=c(NA, NA, 3, 4, 5),
      A_2=c(1, 2, NA, NA, NA),
      B_1=c(NA, NA, 13, 14, 15),
      B_2=c(11, 12, NA, NA, NA))


Expected output: 

A  B
1  11
2  12
3  13
4  14
5  15

I am guessing a mix of dplyr::coalesce with maybe dplyr::mutate_at based on regular expression could be use but I am not sure how to do it. Is there a way to complete this task with the tidyverse grammar?

Thanks!

EDIT: thanks everyone for your answers! However, I should have included the naming convention for my variables to facilitate the transfer of your answers to my actual problem.. I am sorry about that. My variables are geochemistry variables named in two parts (name of chemical element plus name of core).

Example: Al_TAC4.25.275 where Al is the element and TAC4.25.275 is the core. I want to coalesce the data from 3 different cores (second part of name) for each element (first part of name). I have 25 pairs of element to coalesce.


Solution

  • You could use transmute, e.g.

    library(dplyr)
    
    df <- data.frame(
      A_1 = c(NA, NA, 3, 4, 5),
      A_2 = c(1, 2, NA, NA, NA),
      B_1 = c(NA, NA, 13, 14, 15),
      B_2 = c(11, 12, NA, NA, NA)
      )
    
    df %>%
      transmute(A = coalesce(A_1, A_2),
                B = coalesce(B_1, B_2))
    #>   A  B
    #> 1 1 11
    #> 2 2 12
    #> 3 3 13
    #> 4 4 14
    #> 5 5 15
    

    Created on 2021-12-22 by the reprex package (v2.0.1)

    Another option, if you have lots of "A_*" and "B_*" columns (source: Romain François, user: @Romain Francois):

    library(dplyr)
    
    df <- data.frame(
      A_1 = c(NA, NA, 3, 4, 5),
      A_2 = c(1, 2, NA, NA, NA),
      B_1 = c(NA, NA, 13, 14, 15),
      B_2 = c(11, 12, NA, NA, NA)
      )
    
    coacross <- function(...) {
      coalesce(!!!across(...))
    }
    
    df %>%
      transmute(A = coacross(starts_with("A_")),
                B = coacross(starts_with("B_")))
    #>   A  B
    #> 1 1 11
    #> 2 2 12
    #> 3 3 13
    #> 4 4 14
    #> 5 5 15
    

    Created on 2021-12-22 by the reprex package (v2.0.1)

    Edit

    Based on your updated question, you don't have lots of "A_*" or "B_*" columns, but instead lots of "*_1", "*_2", and "*_3" columns. I think this is the most straightforward solution for your use-case:

    library(dplyr)
    
    df <- data.frame(Al_TAC4.25.275 = c(1, 1, 1, NA, NA, NA),
                     Al_TAC4.25.276 = c(NA, NA, 2, 2, 2, NA),
                     Al_TAC4.25.277 = c(NA, NA, 3, NA, NA, 3),
                     Au_TAC4.25.275 = c(1, 1, 1, NA, NA, NA),
                     Au_TAC4.25.276 = c(NA, NA, 2, 2, 2, NA),
                     Au_TAC4.25.277 = c(NA, NA, 3, NA, NA, NA),
                     Ar_TAC4.25.275 = c(1, 1, 1, NA, NA, 1),
                     Ar_TAC4.25.276 = c(NA, NA, 2, 2, 2, 2),
                     Ar_TAC4.25.277 = c(NA, NA, 3, NA, NA, 3))
    
    df
    #>   Al_TAC4.25.275 Al_TAC4.25.276 Al_TAC4.25.277 Au_TAC4.25.275 Au_TAC4.25.276
    #> 1              1             NA             NA              1             NA
    #> 2              1             NA             NA              1             NA
    #> 3              1              2              3              1              2
    #> 4             NA              2             NA             NA              2
    #> 5             NA              2             NA             NA              2
    #> 6             NA             NA              3             NA             NA
    #>   Au_TAC4.25.277 Ar_TAC4.25.275 Ar_TAC4.25.276 Ar_TAC4.25.277
    #> 1             NA              1             NA             NA
    #> 2             NA              1             NA             NA
    #> 3              3              1              2              3
    #> 4             NA             NA              2             NA
    #> 5             NA             NA              2             NA
    #> 6             NA              1              2              3
    
    names(df) %>% 
      split(str_extract(., '[:alpha:]+')) %>%
      map_dfc(~ coalesce(!!!df[.x][c(1,2,3)]))
    #> # A tibble: 6 × 3
    #>      Al    Ar    Au
    #>   <dbl> <dbl> <dbl>
    #> 1     1     1     1
    #> 2     1     1     1
    #> 3     1     1     1
    #> 4     2     2     2
    #> 5     2     2     2
    #> 6     3     1    NA
    
    # change the order of the list to change the 'priority'
    names(df) %>% 
      split(str_extract(., '[:alpha:]+')) %>%
      map_dfc(~ coalesce(!!!df[.x][c(3,2,1)]))
    #> # A tibble: 6 × 3
    #>      Al    Ar    Au
    #>   <dbl> <dbl> <dbl>
    #> 1     1     1     1
    #> 2     1     1     1
    #> 3     3     3     3
    #> 4     2     2     2
    #> 5     2     2     2
    #> 6     3     3    NA
    
    names(df) %>% 
      split(str_extract(., '[:alpha:]+')) %>%
      map_dfc(~ coalesce(!!!df[.x][c(2,1,3)]))
    #> # A tibble: 6 × 3
    #>      Al    Ar    Au
    #>   <dbl> <dbl> <dbl>
    #> 1     1     1     1
    #> 2     1     1     1
    #> 3     2     2     2
    #> 4     2     2     2
    #> 5     2     2     2
    #> 6     3     2    NA
    

    Created on 2021-12-22 by the reprex package (v2.0.1)