Search code examples
rcumsum

Combine / sum rows in R


I have a large dataset with each row representing a different category. However, some of categories have been miss spelled or miss named resulting in two rows for the same category. I would like to combine and sum specific rows.

DF

category 2021 2022 2023
Grain 890 900 978
Dried Fruits and Veg 45 55 58
Dried Fruits & Veg 66 74 88

Expected output

category 2021 2022 2023
Grain 890 900 978
Dried Fruits and Veg 111 62 146

I've been trying something like the strategy below but haven't been able to work it out.

DF %>% 
  mutate_at(c(2021:2023), cumsum(starts_with("Dried Fruits")))

Data:

df <- data.frame(
  stringsAsFactors = FALSE,
              category = c("Grain",
                       "Dried Fruits and Veg","Dried Fruits & Veg"),
              "2021" = c(890L, 45L, 66L),
              "2020" = c(900L, 55L, 74L),
              "2023" = c(978L, 58L, 88L)
)


Solution

  • This solution will possibly work just for this given scenario because I don't know what other types of misspelling and misnaming are there in the category column, but once you clean the category column to have all unique categories, you can simply do a group_by by category column and then summarise over the rest of the columns, like the following,

    library(dplyr)
    library(stringr)
    
    df <- data.frame(
      stringsAsFactors = FALSE,
                  category = c("Grain",
                           "Dried Fruits and Veg","Dried Fruits & Veg"),
                  "2021" = c(890L, 45L, 66L),
                  "2020" = c(900L, 55L, 74L),
                  "2023" = c(978L, 58L, 88L)
    )
    
    df %>% 
      mutate(
        category = str_replace(category, "&", "and")
      ) %>% 
      group_by(category) %>% 
      summarise(
        across(X2021:X2023, sum)
      )
    #> # A tibble: 2 × 4
    #>   category             X2021 X2020 X2023
    #>   <chr>                <int> <int> <int>
    #> 1 Dried Fruits and Veg   111   129   146
    #> 2 Grain                  890   900   978
    

    Created on 2022-07-11 by the reprex package (v2.0.1)

    Hope this helps.