Search code examples
rdataframesum

sum across multiple columns of a data frame based on multiple patterns R


I have a data frame of multiple variables for for different years, that looks kind of like this:

df <- data.frame(name=c("name1", "name2", "name3", "name4"),
                X1990=c(1,6,8,NA),
                X1990.1=c(10,20,NA,2),
                X1990.2=c(2,4,6,8),
                X1990.3=c(1,NA,3,6),
                X1990.4=c(8,7,5,4),
                X1991=c(2,6,3,5),
                X1991.1=c(NA,20,NA,2),
                X1991.2=c(NA,NA,NA,NA),
                X1991.3=c(1,NA,3,5),
                X1991.4=c(8,9,6,3))

I made this example with only 5 variables per year and with only 2 year, but in reality is a much larger df, with tens of variables for the years 1990 to 2020.

I want to create a new dataframe with the sums all the columns for the same year, so that the new data frame looks like this:

df_sum <- data.frame(name=c("name1", "name2", "name3", "name4"),
                     X1990=c(22, 37, 22, 20),
                     X1991=c(11,35,12,15))

I was thinking some loop over rowSums(across(matches('pattern')), na.rm = TRUE) that I found on another questions, but so far have not been successful to implement.

Thanks!


Solution

  • We can reshape to 'long' format with pivot_longer, and get the sum while reshaping back to 'wide'

    library(dplyr)
    library(tidyr)
    library(stringr)
    df %>%
       pivot_longer(cols = starts_with("X"), names_to = "name1") %>% 
       mutate(name1 = str_remove(name1, "\\.\\d+$")) %>% 
       pivot_wider(names_from = name1, values_from = value, 
          values_fn = ~ sum(.x, na.rm = TRUE))
    

    -output

    # A tibble: 4 × 3
      name  X1990 X1991
      <chr> <dbl> <dbl>
    1 name1    22    11
    2 name2    37    35
    3 name3    22    12
    4 name4    20    15
    

    Or in base R, use split.default to split the data into a list of datasets based on the column name pattern, get the rowSums and cbind with the first column

    cbind(df[1], sapply(split.default(df[-1], 
      trimws(names(df)[-1], whitespace = "\\.\\d+")), rowSums, na.rm = TRUE))
       name X1990 X1991
    1 name1    22    11
    2 name2    37    35
    3 name3    22    12
    4 name4    20    15