I have a data fame with a structure similar to this:
df <- data.frame("id1"=c("A", "A", NA, "B", "A", "A"),
"val1"=c(1, 2, NA, 2, 3, 1),
"id2"=c("A", "B", "B", "B", "B", "A"),
"val2"=c(3, 2, 1, 3, 3, 2),
"id3" = c("C", "B", "C", "C", "D", "A"),
"val3" = c(2, 1, 2, 2, 1, 1))
> df
id1 val1 id2 val2 id3 val3
1 A 1 A 3 C 2
2 A 2 B 2 B 1
3 <NA> NA B 1 C 2
4 B 2 B 3 C 2
5 A 3 B 3 D 1
6 A 1 A 2 A 1
What I would like to do is, per row, "merge" those id
columns where there is a string match, and sum the corresponding val
(or value) columns. I hope the request is clear, but in the end I would like an output that looks like:
df_new <- data.frame("id1"=c("A", "A", "B", "B", "A", "A"),
"val1"=c(4, 2, 1, 5, 3, 4),
"id2"=c("C", "B", "C", "C", "B", NA),
"val2"=c(2, 3, 2, 2, 3, NA),
"id3"=c(NA, NA, NA, NA, "D", NA),
"val3"=c(NA, NA,NA, NA, 1, NA))
> df_new
id1 val1 id2 val2 id3 val3
1 A 4 C 2 <NA> NA
2 A 2 B 3 <NA> NA
3 B 1 C 2 <NA> NA
4 B 5 C 2 <NA> NA
5 A 3 B 3 D 1
6 A 4 <NA> NA <NA> NA
I tried doing this with a series of ifelse()
statements but in the end got very confused. There must be a cleaner way to accomplish this. Thank you in advance for any help!
Here's a pipeline in tidyverse
:
library(dplyr)
library(tidyr)
df |>
mutate(rown = row_number()) |>
pivot_longer(-rown, names_pattern = "(id|val)\\d+",
names_to = ".value", values_drop_na = TRUE) |>
summarise(val = sum(val), .by = c(rown, id)) |>
mutate(number = row_number(), .by = rown) |>
pivot_wider(names_from = "number", values_from = c("id", "val"),
names_vary = "slowest", names_sep = "") |>
select(-rown)
# id1 val1 id2 val2 id3 val3
# 1 A 4 C 2 <NA> NA
# 2 A 2 B 3 <NA> NA
# 3 B 1 C 2 <NA> NA
# 4 B 5 C 2 <NA> NA
# 5 A 3 B 3 D 1
# 6 A 4 <NA> NA <NA> NA
Explanation:
The idea is to convert to long to get a tidy format, which makes it easier to manipulate data in general, and especially in the tidyverse
. This is why there is pivot_longer
, and then pivot_wider
(to get to the original wide format). In between, I create a sum
by rown
and id
, and then create a number
variable based on the row_number
with rown
groups, which will be useful to sort correctly in pivot_wider
(with names_vary = "slowest"
).