Search code examples
rdataframeaggregatesummary

Sum variables by unique variable names, for different metrics - requires finding unique names before/after a prefix in variable name


Is there is a way to sum variables (e.g. sales and units) for all unique variable names (brands like coke and pepsi) within a dataframe. To help, here is some example data.

set.seed(123)
period <- seq(as.Date('2021/01/01'), as.Date('2021/01/07'), by="day")
Coke_Regular_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Diet_Units <- sample(1000:2000, 7, replace = TRUE)
Coke_Regular_Sales <- sample(500:1000,7, replace = TRUE)
Coke_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Regular_Units  <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Diet_Units  <- sample(1000:2000, 7, replace = TRUE)
Pepsi_Regular_Sales <- sample(500:1000, 7, replace = TRUE)
Pepsi_Diet_Sales <- sample(500:1000, 7, replace = TRUE)
df <- data.frame(Coke_Regular_Units, Coke_Diet_Units, Coke_Regular_Sales, Coke_Diet_Sales, 
                 Pepsi_Regular_Units, Pepsi_Diet_Units, Pepsi_Regular_Sales, Pepsi_Diet_Sales)

> head(df)
      period Coke_Regular_Units Coke_Diet_Units Coke_Regular_Sales Coke_Diet_Sales Pepsi_Regular_Units
1 2021-01-01               1414            1117                589             847                1425
2 2021-01-02               1462            1298                590             636                1648
3 2021-01-03               1178            1228                755             976                1765
4 2021-01-04               1525            1243                696             854                1210
5 2021-01-05               1194            1013                998             827                1931
6 2021-01-06               1937            1373                590             525                1589
  Pepsi_Diet_Units Pepsi_Regular_Sales Pepsi_Diet_Sales
1             1554                 608              943
2             1870                 762              808
3             1372                 892              634
4             1843                 924              808
5             1142                 829              910
6             1543                 522              723

I like a code to automatically calculate Coke_Sales, Coke_Units, Pepsi_Sales, Pepsi_Units, Regular_Sales and Diet_Units.

I am currently doing it like this for each variable

library(dplyr) 
df$Coke_Sales <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Sales")))))
df$Coke_Units <- rowSums(Filter(is.numeric, select(df, (matches("Coke") & matches("Units")))))

This is ok for a small number of variables, but I need to do this for 100s of variables. Is there any function that enables this? It would need to automatically find the unique variable names like Coke, Pepsi, Diet and Regular. The metric is the last part of the variable name, so doesn't necessarily need to auto-find this but would be great. If it makes it any easier, it would be ok to specify the metrics as there are only 3 metrics at most, but there are hundreds of brands.

If it cant be automated, is there a way it can be simplified, where I specify the variables required. Not perfect but still an improvement. For example including these lines of code to specify variables to sum and metrics required.

VarsToSum <- c("Coke", "Pepsi", "Diet", "Regular")
Metrics <- c("Sales", "Units")

If it can't be accomplished that way either, maybe I need to break into smaller steps, any tips would be great. Trying to think how to do it, should I try to find unique name before a prefix "_", then calculate "Sales" and "Units" for those unique names. Would this be the best way to do it? Or should I reshape the data? Are there any other routes to get there?

Any help, or directions how to achieve this would be greatly appreciated. Thanks


Solution

  • here is a data.tableapproach...

    library( data.table )
    setDT(df) #make it a data.table
    #melt to long
    ans <- melt( df, id.vars = "period", variable.factor = FALSE )
    #split variable to 3 new columns
    ans[, c("brand", "type", "what") := tstrsplit( variable, "_" ) ]
    # > head(ans)
    #        period           variable value brand    type  what
    # 1: 2021-01-01 Coke_Regular_Units  1414  Coke Regular Units
    # 2: 2021-01-02 Coke_Regular_Units  1462  Coke Regular Units
    # 3: 2021-01-03 Coke_Regular_Units  1178  Coke Regular Units
    # 4: 2021-01-04 Coke_Regular_Units  1525  Coke Regular Units
    # 5: 2021-01-05 Coke_Regular_Units  1194  Coke Regular Units
    # 6: 2021-01-06 Coke_Regular_Units  1937  Coke Regular Units
    
    #summarise however you like
    ans[, .(total = sum(value) ), by = .(brand, type, what)]
    #    brand    type  what total
    # 1:  Coke Regular Units 10527
    # 2:  Coke    Diet Units  8936
    # 3:  Coke Regular Sales  5158
    # 4:  Coke    Diet Sales  5171
    # 5: Pepsi Regular Units 11160
    # 6: Pepsi    Diet Units 10813
    # 7: Pepsi Regular Sales  5447
    # 8: Pepsi    Diet Sales  5491