Is there is a way to sum variables (e.g. sales and units) for all unique variable names (brands like coke and pepsi) within a dataframe. To help, here is some example data.
set.seed(123)
period <- seq(as.Date('2021/01/01'), as.Date('2021/01/07'), by="day")
Coke_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Regular_Sales <- sample(500:1000,7, replace = TRUE)
Coke_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Regular_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
df <- data.frame(Coke_Regular_Units, Coke_Diet_Units, Coke_Regular_Sales, Coke_Diet_Sales,
Pepsi_Regular_Units, Pepsi_Diet_Units, Pepsi_Regular_Sales, Pepsi_Diet_Sales)
> head(df)
period Coke_Regular_Units Coke_Diet_Units Coke_Regular_Sales Coke_Diet_Sales Pepsi_Regular_Units
1 2021-01-01 1414 1117 589 847 1425
2 2021-01-02 1462 1298 590 636 1648
3 2021-01-03 1178 1228 755 976 1765
4 2021-01-04 1525 1243 696 854 1210
5 2021-01-05 1194 1013 998 827 1931
6 2021-01-06 1937 1373 590 525 1589
Pepsi_Diet_Units Pepsi_Regular_Sales Pepsi_Diet_Sales
1 1554 608 943
2 1870 762 808
3 1372 892 634
4 1843 924 808
5 1142 829 910
6 1543 522 723
I like a code to automatically calculate Coke_Sales, Coke_Units, Pepsi_Sales, Pepsi_Units, Regular_Sales and Diet_Units.
I am currently doing it like this for each variable
library(dplyr)
df$Coke_Sales <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Sales")))))
df$Coke_Units <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Units")))))
This is ok for a small number of variables, but I need to do this for 100s of variables. Is there any function that enables this? It would need to automatically find the unique variable names like Coke, Pepsi, Diet and Regular. The metric is the last part of the variable name, so doesn't necessarily need to auto-find this but would be great. If it makes it any easier, it would be ok to specify the metrics as there are only 3 metrics at most, but there are hundreds of brands.
If it cant be automated, is there a way it can be simplified, where I specify the variables required. Not perfect but still an improvement. For example including these lines of code to specify variables to sum and metrics required.
VarsToSum <- c("Coke", "Pepsi", "Diet", "Regular")
Metrics <- c("Sales", "Units")
If it can't be accomplished that way either, maybe I need to break into smaller steps, any tips would be great. Trying to think how to do it, should I try to find unique name before a prefix "_", then calculate "Sales" and "Units" for those unique names. Would this be the best way to do it? Or should I reshape the data? Are there any other routes to get there?
Any help, or directions how to achieve this would be greatly appreciated. Thanks
here is a data.table
approach...
library( data.table )
setDT(df) #make it a data.table
#melt to long
ans <- melt( df, id.vars = "period", variable.factor = FALSE )
#split variable to 3 new columns
ans[, c("brand", "type", "what") := tstrsplit( variable, "_" ) ]
# > head(ans)
# period variable value brand type what
# 1: 2021-01-01 Coke_Regular_Units 1414 Coke Regular Units
# 2: 2021-01-02 Coke_Regular_Units 1462 Coke Regular Units
# 3: 2021-01-03 Coke_Regular_Units 1178 Coke Regular Units
# 4: 2021-01-04 Coke_Regular_Units 1525 Coke Regular Units
# 5: 2021-01-05 Coke_Regular_Units 1194 Coke Regular Units
# 6: 2021-01-06 Coke_Regular_Units 1937 Coke Regular Units
#summarise however you like
ans[, .(total = sum(value) ), by = .(brand, type, what)]
# brand type what total
# 1: Coke Regular Units 10527
# 2: Coke Diet Units 8936
# 3: Coke Regular Sales 5158
# 4: Coke Diet Sales 5171
# 5: Pepsi Regular Units 11160
# 6: Pepsi Diet Units 10813
# 7: Pepsi Regular Sales 5447
# 8: Pepsi Diet Sales 5491