I have a large dataset with each row representing a different category. However, some of categories have been miss spelled or miss named resulting in two rows for the same category. I would like to combine and sum specific rows.
DF
category | 2021 | 2022 | 2023 |
---|---|---|---|
Grain | 890 | 900 | 978 |
Dried Fruits and Veg | 45 | 55 | 58 |
Dried Fruits & Veg | 66 | 74 | 88 |
Expected output
category | 2021 | 2022 | 2023 |
---|---|---|---|
Grain | 890 | 900 | 978 |
Dried Fruits and Veg | 111 | 62 | 146 |
I've been trying something like the strategy below but haven't been able to work it out.
DF %>%
mutate_at(c(2021:2023), cumsum(starts_with("Dried Fruits")))
Data:
df <- data.frame(
stringsAsFactors = FALSE,
category = c("Grain",
"Dried Fruits and Veg","Dried Fruits & Veg"),
"2021" = c(890L, 45L, 66L),
"2020" = c(900L, 55L, 74L),
"2023" = c(978L, 58L, 88L)
)
This solution will possibly work just for this given scenario because I don't know what other types of misspelling and misnaming are there in the category column, but once you clean the category column to have all unique categories, you can simply do a group_by
by category column and then summarise
over the rest of the columns, like the following,
library(dplyr)
library(stringr)
df <- data.frame(
stringsAsFactors = FALSE,
category = c("Grain",
"Dried Fruits and Veg","Dried Fruits & Veg"),
"2021" = c(890L, 45L, 66L),
"2020" = c(900L, 55L, 74L),
"2023" = c(978L, 58L, 88L)
)
df %>%
mutate(
category = str_replace(category, "&", "and")
) %>%
group_by(category) %>%
summarise(
across(X2021:X2023, sum)
)
#> # A tibble: 2 × 4
#> category X2021 X2020 X2023
#> <chr> <int> <int> <int>
#> 1 Dried Fruits and Veg 111 129 146
#> 2 Grain 890 900 978
Created on 2022-07-11 by the reprex package (v2.0.1)
Hope this helps.