Search code examples
rapriori

R: apriori package/algorithm not working as expected in generating 'confidence' metric


I have a dataset that looks like this:

structure(list(CATEGORY = c("Flower", "Flower", "Concentrate,Flower", 
"Flower", "Flower", "Flower", "Flower", "Edible,Flower", "Concentrate,Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Edible,Flower", 
"Concentrate,Flower", "Flower", "Edible,Flower", "Flower", "Edible,Flower", 
"Edible,Flower", "Concentrate,Flower", "Flower", "Concentrate", 
"Flower", "Edible,Flower", "Flower", "Flower", "Flower", "Concentrate", 
"Edible,Flower", "Concentrate", "Flower", "Flower", "Concentrate,Flower", 
"Edible,Flower", "Flower", "Flower", "Edible,Flower", "Concentrate,Flower", 
"Concentrate", "Concentrate", "Concentrate", "Concentrate", "Edible,Flower", 
"Flower", "Edible,Flower", "Flower", "Concentrate", "Flower", 
"Concentrate,Flower", "Edible,Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Concentrate", "Concentrate", 
"Flower", "Flower", "Flower", "Edible,Flower", "Concentrate", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Concentrate", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Concentrate,Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Concentrate", 
"Flower", "Concentrate", "Flower", "Flower", "Flower", "Flower", 
"Edible,Flower", "Flower", "Concentrate,Flower", "Concentrate,Flower", 
"Flower", "Edible,Flower", "Flower", "Flower", "Flower", "Flower", 
"Concentrate,Flower", "Concentrate", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Concentrate,Flower", "Flower", 
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Flower", 
"Concentrate", "Concentrate,Flower", "Flower", "Flower", "Flower", 
"Edible,Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Concentrate", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Flower", "Flower", "Edible,Flower", "Concentrate", "Flower", 
"Flower", "Flower", "Flower", "Flower", "Flower", "Flower", "Flower", 
"Concentrate,Flower", "Flower", "Flower", "Flower", "Flower", 
"Edible,Flower")), row.names = c(NA, -200L), class = c("tbl_df", 
"tbl", "data.frame"))

glimpse(interesting_basket_items5)

Rows: 200
Columns: 1
$ CATEGORY <chr> "Flower", "Flower", "Concentrate,Flower", "Flower", "Flower", "Flower", "Flower", "Edible,Flower", "Concentrat…

I don't know why the following code using the arules package is not working as expected:

interesting_basket_items_list <- as(interesting_basket_items5, "transactions")

# mine association rules with the 'apriori' function
rules <- apriori(interesting_basket_items_list, parameter = list(support = 0.001, confidence = 0.05))

# sort the rules by lift
rules <- sort(rules, by = "lift")

# inspect the resulting rules
(rules <- inspect(rules))

enter image description here

The 'support' part of the output looks correct, as checked against my own logic here:

interesting_basket_items5 %>%
  group_by(CATEGORY) %>%
  tally() %>%
  mutate(pct = n / sum(n)) 

enter image description here

But the confidence part doesn't make sense to me.

I would think that the confidence of a rule for lhs = {Concentrate} -> rhs {Concentrate, Flower} would have a confidence of 2/3 or .67 because I am dividing 14 by 21, which is the number of transactions containing Concentrate and Flower by the number of transaction containing Concentrate

In general, I can't understand why the lhs of these association rules is totally blank like this: {} instead of showing a more interesting antecedent. The thresholds that I am using in the code should be inclusive: parameter = list(support = 0.001, confidence = 0.05)) and the structure of the input data as a list within a list "transactions dataset" I think is correct {Concentrate}, {Concentrate, Flower} etc

Shouldn't this rules dataset have a lhs equal to Concentrate and a rhs equal to {Concentrate, Flower} with support = 0.070 and confidence = 0.67?

I would like to be able to understand how this is working in terms of probability and conditional probability, in a way where I can demonstrate how it makes sense, start-to-finish, instead of taking data blindly into the apriori package and trusting the output, so to speak.

I'd be happy with a solution that shows how to change the structure of data input into apriori to get a result that makes sense, or a tuning of the arguments of the apriori package, or else I would be happy with a solution that does this in tidyverse, some way of implementing the probability that is support and the conditional probability that is confidence in a way that is comprehensive and easy to demonstrate in the code. I'm just not certain tidyverse would easily be comprehensive of all possible permutations of baskets on the conditional probability part, which is probably why the apriori package/algorithm was designed...

I'm able to get apriori to run but I have my doubts it's really correct for what I want, and I don't know how to coerce anything to make the result more accurate/intuitive, and I also don't know how to reconstruct what apriori is doing in tidyverse, so I'm stuck. I'd appreciate some kind of answer that could bring this together.


Solution

  • The problem is with the data format: in each row you have just one string which is treated by apriori as a transaction, consisting of a single item, where in fact it's a list of items, joined by ,. Before feeding it to apriori you have to split it:

    basket_split <- sapply(
      unlist(interesting_basket_items5),
      function(x) strsplit(x, ",")
    )
    interesting_basket_items_list <- transactions(basket_split)
    rules <- apriori(
      interesting_basket_items_list,
      parameter = list(supp = 0.001, conf = 0.05)
    )
    inspect(rules)
    

    Then the output would look right:

        lhs              rhs           support confidence coverage lift      count
    [1] {}            => {Edible}      0.090   0.09000000 1.000    1.0000000  18  
    [2] {}            => {Concentrate} 0.175   0.17500000 1.000    1.0000000  35  
    [3] {}            => {Flower}      0.895   0.89500000 1.000    1.0000000 179  
    [4] {Edible}      => {Flower}      0.090   1.00000000 0.090    1.1173184  18  
    [5] {Flower}      => {Edible}      0.090   0.10055866 0.895    1.1173184  18  
    [6] {Concentrate} => {Flower}      0.070   0.40000000 0.175    0.4469274  14  
    [7] {Flower}      => {Concentrate} 0.070   0.07821229 0.895    0.4469274  14