Combine pairs of columns across several columns with similar names within one data frame

I need to join data in columns that have similar, but slightly different names, into a singular column with a comma separating the data.

For example:

  df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1_1 = c('1','2','3','4','5'),
                  loci1_2 = c('5','4','3','2','1'),
                  loci2_1 = c('2','3','4','5','1'),
                  loci2_2 = c('3','4','5','2','1')
                  )

I have >100 columns named in such a manner to combine so that it might look like the following:

Sample_ID	loci1	loci2
animal1	1,5	2,3
animal2	2,4	3,4
animal3	3,3	4,5
animal4	4,2	5,2

If I create a list of all the final column names, can I iterate through the list in a for loop?

Solution

No need for a loop, you can achieve your desired outcome much faster using the tidyverse functions dplyr::mutate(), tidyr::pivot_longer(), and tidyr::pivot_wider(). I modified your example data to add another pair for illustrative purposes:

library(dplyr)
library(tidyr)

df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  loci1_1 = c('1','2','3','4','5'),
                  loci1_2 = c('5','4','3','2','1'),
                  loci2_1 = c('2','3','4','5','1'),
                  loci2_2 = c('3','4','5','2','1'),
                  loci3_1 = c('4','1','1','2','2'),
                  loci3_2 = c('6','5','4','3','2')
)

# Pivot df1 to long form, concatenate columns, then pivot back to wide form
result <- df1 |>
  pivot_longer(cols = -sample_ID, 
               names_to = c("loci", ".value"), 
               names_sep = "_") |>
  mutate(value = paste0(`1`, ",", `2`)) |>
  select(-c(`1`, `2`)) |>
  pivot_wider(names_from = loci, values_from = value)

result
# # A tibble: 5 × 4
#   sample_ID loci1 loci2 loci3
#   <chr>     <chr> <chr> <chr>
# 1 animal1   1,5   2,3   4,6  
# 2 animal2   2,4   3,4   1,5  
# 3 animal3   3,3   4,5   1,4  
# 4 animal4   4,2   5,2   2,3  
# 5 animal5   5,1   1,1   2,2

Update based on OP's comment

In the first example, names_sep was sufficient as there is only one underscore to 'match'. For more complex matches, such as where there are multiple underscores, you need to add a regex to capture patterns using names_pattern. In your case, two patterns. Like this:

df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
                  ABC_loci_1 = c('1','2','3','4','5'),
                  ABC_loci_2 = c('5','4','3','2','1'),
                  DEF_loci_1 = c('2','3','4','5','1'),
                  DEF_loci_2 = c('3','4','5','2','1'),
                  GHI_loci_1 = c('4','1','1','2','2'),
                  GHI_loci_2 = c('6','5','4','3','2')
)

result <- df1 |>
  pivot_longer(cols = -sample_ID,
               names_to = c("loci", ".value"),
               names_pattern = "([^_]+_[^_]+)_(.)") |>
  mutate(value = paste0(`1`, ",", `2`)) |>
  select(-c(`1`, `2`)) |>
  pivot_wider(names_from = loci,
              values_from = value)

# # A tibble: 5 × 4
#   sample_ID ABC_loci DEF_loci GHI_loci
#   <chr>     <chr>    <chr>    <chr>   
# 1 animal1   1,5      2,3      4,6     
# 2 animal2   2,4      3,4      1,5     
# 3 animal3   3,3      4,5      1,4     
# 4 animal4   4,2      5,2      2,3     
# 5 animal5   5,1      1,1      2,2

The regex ([^_]+_[^_]+)_(.) works like this:

[^_] matches any character except an underscore
the + will then match one or more characters
_ is a literal match
placing a combination of these inside () 'captures' everything up to the next literal e.g. _
(.) captures a single character after the preceding literal e.g. _