I need to join data in columns that have similar, but slightly different names, into a singular column with a comma separating the data.
For example:
df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
loci1_1 = c('1','2','3','4','5'),
loci1_2 = c('5','4','3','2','1'),
loci2_1 = c('2','3','4','5','1'),
loci2_2 = c('3','4','5','2','1')
)
I have >100 columns named in such a manner to combine so that it might look like the following:
Sample_ID | loci1 | loci2 | loci3.. and so on |
---|---|---|---|
animal1 | 1,5 | 2,3 | |
animal2 | 2,4 | 3,4 | |
animal3 | 3,3 | 4,5 | |
animal4 | 4,2 | 5,2 |
If I create a list of all the final column names, can I iterate through the list in a for loop?
No need for a loop, you can achieve your desired outcome much faster using the tidyverse
functions dplyr::mutate()
, tidyr::pivot_longer()
, and tidyr::pivot_wider()
. I modified your example data to add another pair for illustrative purposes:
library(dplyr)
library(tidyr)
df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
loci1_1 = c('1','2','3','4','5'),
loci1_2 = c('5','4','3','2','1'),
loci2_1 = c('2','3','4','5','1'),
loci2_2 = c('3','4','5','2','1'),
loci3_1 = c('4','1','1','2','2'),
loci3_2 = c('6','5','4','3','2')
)
# Pivot df1 to long form, concatenate columns, then pivot back to wide form
result <- df1 |>
pivot_longer(cols = -sample_ID,
names_to = c("loci", ".value"),
names_sep = "_") |>
mutate(value = paste0(`1`, ",", `2`)) |>
select(-c(`1`, `2`)) |>
pivot_wider(names_from = loci, values_from = value)
result
# # A tibble: 5 × 4
# sample_ID loci1 loci2 loci3
# <chr> <chr> <chr> <chr>
# 1 animal1 1,5 2,3 4,6
# 2 animal2 2,4 3,4 1,5
# 3 animal3 3,3 4,5 1,4
# 4 animal4 4,2 5,2 2,3
# 5 animal5 5,1 1,1 2,2
Update based on OP's comment
In the first example, names_sep
was sufficient as there is only one underscore to 'match'. For more complex matches, such as where there are multiple underscores, you need to add a regex to capture patterns using names_pattern
. In your case, two patterns. Like this:
df1 <- data.frame(sample_ID = c('animal1', 'animal2', 'animal3', 'animal4', 'animal5'),
ABC_loci_1 = c('1','2','3','4','5'),
ABC_loci_2 = c('5','4','3','2','1'),
DEF_loci_1 = c('2','3','4','5','1'),
DEF_loci_2 = c('3','4','5','2','1'),
GHI_loci_1 = c('4','1','1','2','2'),
GHI_loci_2 = c('6','5','4','3','2')
)
result <- df1 |>
pivot_longer(cols = -sample_ID,
names_to = c("loci", ".value"),
names_pattern = "([^_]+_[^_]+)_(.)") |>
mutate(value = paste0(`1`, ",", `2`)) |>
select(-c(`1`, `2`)) |>
pivot_wider(names_from = loci,
values_from = value)
# # A tibble: 5 × 4
# sample_ID ABC_loci DEF_loci GHI_loci
# <chr> <chr> <chr> <chr>
# 1 animal1 1,5 2,3 4,6
# 2 animal2 2,4 3,4 1,5
# 3 animal3 3,3 4,5 1,4
# 4 animal4 4,2 5,2 2,3
# 5 animal5 5,1 1,1 2,2
The regex ([^_]+_[^_]+)_(.)
works like this: