Search code examples
rdataframeloopscalculated-columnsalgebra

Calculate Top N products by sales and years


I have the data about sales by years and by-products, let's say like this:

Year <- c(2010,2010,2010,2010,2010,2011,2011,2011,2011,2011,2012,2012,2012,2012,2012)
Model <- c("a","b","c","d","e","a","b","c","d","e","a","b","c","d","e")
Sale <- c("30","45","23","33","24","11","56","19","45","56","33","32","89","33","12")
df <- data.frame(Year, Model, Sale)

product by years:

a= 30+11+33 = 74
b= 45+56+32 = 133
c= 23+19+89 = 131
d= 33+45+33 = 111
e= 12+56+24 = 92

Ranking by according to total sales within these 3 years:

1 2 3 4 5 
b c d e a

I want the code which identifies the TOP 2 products (according to total sales within these 3 years) by years and summarises all the rest products as category "other". So the output should be like this:

year     Model          Sale
2010      b              45
2010      c              23
2010      other          30+33+24=92
2011      b              56
2011      c              19
2011      other          11+45+56=112
2012      b              32
2012      c              89
2012      other          33+33+12= 78

Solution

  • A tidyverse solution. Your Sale data appear to be stored as character, which means we'll have to use as.numeric before summing them.

    library(tidyverse)
    
    df %>% 
      group_by(Model) %>% 
      mutate(
        Sale = as.numeric(Sale),
        total_sale = sum(Sale)
      ) %>% 
      ungroup %>% 
      mutate(
        model_condensed = ifelse(total_sale %in% rev(sort(unique(total_sale)))[1:2], Model, 'other')
      ) %>% 
      group_by(Year, model_condensed) %>% 
      summarize(Sale = sum(Sale))
    
       Year model_condensed  Sale
      <dbl> <chr>           <dbl>
    1  2010 b                  45
    2  2010 c                  23
    3  2010 other              87
    4  2011 b                  56
    5  2011 c                  19
    6  2011 other             112
    7  2012 b                  32
    8  2012 c                  89
    9  2012 other              78
    

    The above solution creates the "other" category by matching on the values in Sale. This could cause problems if those values have decimal places (see this question). Instead, we could use a two-step process to identify the top two Models by name, and use this to create the groupings for the total data:

    totals <- df %>% 
      group_by(Model) %>% 
      summarize(total_sale = sum(as.numeric(Sale))) %>% 
      arrange(desc(total_sale)) %>% 
      slice_head(n = 2)
    
    df %>% 
      group_by(Year, model_condensed = ifelse(Model %in% totals$Model, Model, 'other')) %>% 
      summarize(Sale = sum(as.numeric(Sale)))