Search code examples
rdata.tablemarket-basket-analysis

How to find number of times items and their combinations were purchased?


I have a data.table showing which items a customer purchased. Each row represents one customer and each column one item. The table has same number of columns for each customer and the values in columns item* are either 1 or 0 based on whether customer purchased a given item. Simple version of the table looks like this:

data.table(customerID = c(1,2,3,4,5),
           item1 = c(1,0,0,1,1),
           item2 = c(1,0,1,1,1),
           item3 = c(1,0,0,0,1),
           item4 = c(0,1,1,1,1))

The table says that customer 1 purchased items 1,2,3 and item 3 was purchased by customers 1 and 5.

In real case the data.table has so many columns it would be impractical to refer to them by name in the code, but it would be ok to have the data in long format instead.

I need to find out how many times individual items were purhcased and how many times their combinations were purchased. In this case i would like to get something like:

item1 3
item2 4
item3 2
item4 4
item1;item2 3
item1;item3 2
item1;item4 1
...
(same for other combinations of length 2)
...
item1;item2;item3 2
item1;item2;item4 1

...
up to combinations of 4 items.

Moreover, i'll need for each customer a table indicating which combinations of products he or she purchased.

Edit:

Thanks to three very useful answers, i know how to answer the first part of the question - i.e. calculate how many customers purchased certain combination. However, the second part remains unanswered. I would like to know which customers purchased which combination.


Solution

  • This is a completely base R option so converting the data to dataframe

    df <- data.frame(df)
    unique_product <- names(df[-1])
    
    stack(unlist(sapply(seq_along(unique_product), function(x) 
         combn(unique_product, x, FUN = function(y) 
               setNames(sum(rowSums(df[y] == 1) == length(y)), 
                paste0(y, collapse = ";")), simplify = FALSE))))
    
    
    #   values                     ind
    #1       3                   item1
    #2       4                   item2
    #3       2                   item3
    #4       4                   item4
    #5       3             item1;item2
    #6       2             item1;item3
    #7       2             item1;item4
    #8       2             item2;item3
    #9       3             item2;item4
    #10      1             item3;item4
    #11      2       item1;item2;item3
    #12      2       item1;item2;item4
    #13      1       item1;item3;item4
    #14      1       item2;item3;item4
    #15      1 item1;item2;item3;item4
    

    We create all combination of each unique product with combn and for every combination count how many of them occur together by subsetting the respective columns from dataframe.


    To get the customers which unlocked certain combinations we can continue the same approach

    stack(unlist(sapply(seq_along(unique_product), function(x) 
         combn(unique_product, x, FUN = function(y) {
          inds <- rowSums(df[x] == 1) == length(x)
          setNames(df$customerID[inds], 
                 rep(paste0(y, collapse = ";"), sum(inds)))
                 }, simplify = FALSE))))
    
    #   values                     ind
    #1       1                   item1
    #2       1                   item2
    #3       1                   item3
    #4       1                   item4
    #5       1             item1;item2
    #6       4             item1;item2
    #7       5             item1;item2
    #8       1             item1;item3
    #9       4             item1;item3
    #10      5             item1;item3
    #....
    

    You can rename the columns if needed but here values are the Customer Id's and ind are the combinations which the respective customer unlocked.