Search code examples
rdplyrtidyr

Sum columns based on a string match in other columns


I have a data fame with a structure similar to this:

df <- data.frame("id1"=c("A", "A", NA, "B", "A", "A"),
                 "val1"=c(1, 2, NA, 2, 3, 1),
                 "id2"=c("A", "B", "B", "B", "B", "A"),
                 "val2"=c(3, 2, 1, 3, 3, 2),
                 "id3" = c("C", "B", "C", "C", "D", "A"),
                 "val3" = c(2, 1, 2, 2, 1, 1))

> df
   id1 val1 id2 val2 id3 val3
1    A    1   A    3   C    2
2    A    2   B    2   B    1
3 <NA>   NA   B    1   C    2
4    B    2   B    3   C    2
5    A    3   B    3   D    1
6    A    1   A    2   A    1

What I would like to do is, per row, "merge" those id columns where there is a string match, and sum the corresponding val (or value) columns. I hope the request is clear, but in the end I would like an output that looks like:

df_new <- data.frame("id1"=c("A", "A", "B", "B", "A", "A"),
                     "val1"=c(4, 2, 1, 5, 3, 4),
                     "id2"=c("C", "B", "C", "C", "B", NA),
                     "val2"=c(2, 3, 2, 2, 3, NA),
                     "id3"=c(NA, NA, NA, NA, "D", NA),
                     "val3"=c(NA, NA,NA, NA, 1, NA))

> df_new
  id1 val1  id2 val2  id3 val3
1   A    4    C    2 <NA>   NA
2   A    2    B    3 <NA>   NA
3   B    1    C    2 <NA>   NA
4   B    5    C    2 <NA>   NA
5   A    3    B    3    D    1
6   A    4 <NA>   NA <NA>   NA

I tried doing this with a series of ifelse() statements but in the end got very confused. There must be a cleaner way to accomplish this. Thank you in advance for any help!


Solution

  • Here's a pipeline in tidyverse:

    library(dplyr)
    library(tidyr)
    
    df |>
      mutate(rown = row_number()) |> 
      pivot_longer(-rown, names_pattern = "(id|val)\\d+", 
                   names_to = ".value", values_drop_na = TRUE) |> 
      summarise(val = sum(val), .by = c(rown, id)) |> 
      mutate(number = row_number(), .by = rown) |> 
      pivot_wider(names_from = "number", values_from = c("id", "val"),
                  names_vary = "slowest", names_sep = "") |>
      select(-rown)
    
    #   id1 val1  id2 val2  id3 val3
    # 1   A    4    C    2 <NA>   NA
    # 2   A    2    B    3 <NA>   NA
    # 3   B    1    C    2 <NA>   NA
    # 4   B    5    C    2 <NA>   NA
    # 5   A    3    B    3    D    1
    # 6   A    4 <NA>   NA <NA>   NA
    

    Explanation: The idea is to convert to long to get a tidy format, which makes it easier to manipulate data in general, and especially in the tidyverse. This is why there is pivot_longer, and then pivot_wider (to get to the original wide format). In between, I create a sum by rown and id, and then create a number variable based on the row_number with rown groups, which will be useful to sort correctly in pivot_wider (with names_vary = "slowest").