Search code examples
rfor-loopjoinmerge

Combine pairs of columns across several columns with similar names within one data frame


I need to join data in columns that have similar, but slightly different names, into a singular column with a comma separating the data.

For example:

  df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1_1 = c('1','2','3','4','5'),
                  loci1_2 = c('5','4','3','2','1'),
                  loci2_1 = c('2','3','4','5','1'),
                  loci2_2 = c('3','4','5','2','1')
                  )

I have >100 columns named in such a manner to combine so that it might look like the following:

Sample_ID loci1 loci2 loci3.. and so on
animal1 1,5 2,3
animal2 2,4 3,4
animal3 3,3 4,5
animal4 4,2 5,2

If I create a list of all the final column names, can I iterate through the list in a for loop?


Solution

  • No need for a loop, you can achieve your desired outcome much faster using the tidyverse functions dplyr::mutate(), tidyr::pivot_longer(), and tidyr::pivot_wider(). I modified your example data to add another pair for illustrative purposes:

    library(dplyr)
    library(tidyr)
    
    df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                      loci1_1 = c('1','2','3','4','5'),
                      loci1_2 = c('5','4','3','2','1'),
                      loci2_1 = c('2','3','4','5','1'),
                      loci2_2 = c('3','4','5','2','1'),
                      loci3_1 = c('4','1','1','2','2'),
                      loci3_2 = c('6','5','4','3','2')
    )
    
    # Pivot df1 to long form, concatenate columns, then pivot back to wide form
    result <- df1 |>
      pivot_longer(cols = -sample_ID, 
                   names_to = c("loci", ".value"), 
                   names_sep = "_") |>
      mutate(value = paste0(`1`, ",", `2`)) |>
      select(-c(`1`, `2`)) |>
      pivot_wider(names_from = loci, values_from = value)
    
    result
    # # A tibble: 5 × 4
    #   sample_ID loci1 loci2 loci3
    #   <chr>     <chr> <chr> <chr>
    # 1 animal1   1,5   2,3   4,6  
    # 2 animal2   2,4   3,4   1,5  
    # 3 animal3   3,3   4,5   1,4  
    # 4 animal4   4,2   5,2   2,3  
    # 5 animal5   5,1   1,1   2,2
    

    Update based on OP's comment

    In the first example, names_sep was sufficient as there is only one underscore to 'match'. For more complex matches, such as where there are multiple underscores, you need to add a regex to capture patterns using names_pattern. In your case, two patterns. Like this:

    df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                      ABC_loci_1 = c('1','2','3','4','5'),
                      ABC_loci_2 = c('5','4','3','2','1'),
                      DEF_loci_1 = c('2','3','4','5','1'),
                      DEF_loci_2 = c('3','4','5','2','1'),
                      GHI_loci_1 = c('4','1','1','2','2'),
                      GHI_loci_2 = c('6','5','4','3','2')
    )
    
    result <- df1 |>
      pivot_longer(cols = -sample_ID,
                   names_to = c("loci", ".value"),
                   names_pattern = "([^_]+_[^_]+)_(.)") |>
      mutate(value = paste0(`1`, ",", `2`)) |>
      select(-c(`1`, `2`)) |>
      pivot_wider(names_from = loci,
                  values_from = value)
    
    # # A tibble: 5 × 4
    #   sample_ID ABC_loci DEF_loci GHI_loci
    #   <chr>     <chr>    <chr>    <chr>   
    # 1 animal1   1,5      2,3      4,6     
    # 2 animal2   2,4      3,4      1,5     
    # 3 animal3   3,3      4,5      1,4     
    # 4 animal4   4,2      5,2      2,3     
    # 5 animal5   5,1      1,1      2,2  
    

    The regex ([^_]+_[^_]+)_(.) works like this:

    • [^_] matches any character except an underscore
    • the + will then match one or more characters
    • _ is a literal match
    • placing a combination of these inside () 'captures' everything up to the next literal e.g. _
    • (.) captures a single character after the preceding literal e.g. _