Search code examples
rdplyraggregatetidyrsummarize

How to summarize number of products purchased by a customer based on a product list in R


I have two dataframes in R, one consisting of a list of product SKUs (product ids), the other of a purchase log containing the order number, the customer email, the purchase date as well as the product id (product_sku) and quantity purchased.

purchases_dataframe:

order_number | email                | product_sku | quantity | purchase_date
1000         |customer1@sample.com  | RT-100      | 2        | 2020-01-01
1000         |customer1@sample.com  | CT-300      | 1        | 2020-01-01
1000         |customer1@sample.com  | Phone-100   | 1        | 2020-01-01
2000         |customer2@sample.com  | Phone-200   | 1        | 2020-04-20
2000         |customer2@sample.com  | OM-200      | 1        | 2020-04-20
3000         |customer3@sample.com  | CT-300      | 3        | 2020-03-15
4000         |customer1@sample.com  | OM-200      | 5        | 2020-07-07
5000         |customer4@sample.com  | Phone-200   | 3        | 2020-08-19
6000         |customer3@sample.com  | Phone-100   | 1        | 2020-09-22
6000         |customer3@sample.com  | RT-100      | 1        | 2020-09-22

tv_list:

 SKU
    RT-100
    CT-300
    OM-200
    LL-400
    ...

I would like to count the total number of TVs a customer has purchased in his/her lifetime and disregard all other products (e.g. phones). The dataframe tv_list should help me identify which SKUs are TVs and which are not, as I have a variety of different TV SKUs and the above is just a smaller example. The resulting dataframe would ideally look like this:

email                | number_purchased_tv
customer1@sample.com | 8
customer2@sample.com | 1
customer3@sample.com | 4
customer4@sample.com | 0

For reproducability and in order to easier follow my examples, here is the code for the two sample_tables above:

purchase_dataframe <- data.frame(order_number = c(1000,1000,1000, 2000,2000, 3000, 4000, 5000, 6000, 6000),
                      email = c("customer1@sample.com","customer1@sample.com", "customer1@sample.com","customer2@sample.com",
                                "customer2@sample.com","customer3@sample.com","customer1@sample.com","customer4@sample.com",
                                "customer3@sample.com","customer3@sample.com"),
                      product_sku = c("RT-100", "CT-300", "Phone-100", "Phone-200", "OM-200", "CT-300", "OM-200", "Phone-200", "Phone-100", "RT-100"),
                      quantity = c(2,1,1,1,1,3,5,3,1,1),
                      purchase_date = c("2020-01-01","2020-01-01","2020-01-01","2020-04-20","2020-04-20","2020-03-15","2020-07-07","2020-08-19","2020-09-22","2020-09-22"))

tv_list <- data.frame(SKU = c("RT-100", "OM-200", "CT-300", "LL-400", "ZV-700"))

Thanks a lot!


Solution

  • The below does what you requested using dplyr

    library(dplyr)
    library(data.table)
    purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
    # A tibble: 4 x 2
    email                sumtv
    <chr>                <dbl>
      1 customer1@sample.com     8
    2 customer2@sample.com     1
    3 customer3@sample.com     4
    4 customer4@sample.com     0
    

    EDIT please find a correction above regarding the sumtv figure and a data.table solution below

    library(dplyr)
    library(data.table)
    purchase_datatable <- purchase_dataframe
    purchase_datatable %>% setDT
    > purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
      +   ,.(email, sumtv)] %>% unique
    email sumtv
    1: customer1@sample.com     8
    2: customer2@sample.com     1
    3: customer3@sample.com     4
    4: customer4@sample.com     0
    

    microbenchmarking gives almost a 50% advantage to the data.table solution which IMO is an excellent package well worth learning through these vignettes

    library(microbenchmark)
    microbenchmark(purchase_datatable[,sumtv := sum(quantity[product_sku %in% unique(tv_list$SKU)]), by="email"][
      ,.(email, sumtv)] %>% unique, purchase_dataframe %>% dplyr::group_by(email) %>% dplyr::summarise(sumtv = sum(quantity[product_sku %in% unique(tv_list$SKU)]))
    )
    min      lq     mean  median     uq    max neval
    1.268 1.42700 1.823445 1.80300 2.0887 2.8332   100
    2.715 2.98025 3.250287 3.20355 3.3509 8.8255   100