Data: I have two mini example datasets in R called alpha and bravo. Each has 199 rows, 4 columns, and all values are binary. The distributions of 0 and 1 values differ between alpha and bravo, but they're ballpark approximate. Reproducible data code at the end of this post.
Goal: Produce the associated rules with LHS, RHS, Support, Confidence, and Lift Ratio values for each dataset.
Problem: I have two questions / asks.
apriori
function can be fussy on the input transaction data, particularly if the data is coerced into transaction format, but I'm confused on some of the results I'm seeing. Hoping someone here can help explain them.Work thus far:
apriori
with both datasets input into the function as binary, logical, and transaction formats.apriori
parameters to lower the Support and Confidence values in case the alpha's 'no rules' results were a threshold issue.Understanding apriori in R: From what I've read in the apriori docs, I should be fine inputting my data in binary, logical, or transaction formats; however, the first two will be coerced to transaction when processed within apriori
function. I also noted the warning that such coercion may cause issues if the data is not "well behaved", in relation to the itemCoding
and discretizeDF
functions but yet haven't pinpointed how that would tie to everything I'm seeing.
Data sneak peak (Repro code below)
alpha[1:3,] |
bravo[1:3,] |
---|---|
a1 a2 a3 a4 | b1 b2 b3 b4 |
1 0 0 1 | 0 0 1 1 |
1 0 0 1 | 1 1 1 1 |
0 0 0 1 | 0 0 1 1 |
Example association rule code & outputs
# create rules with three data input options
# alpha
a.bin.rules <- apriori(alpha) # binary
a.log.rules <- apriori(alpha>0.5) # logical
a.tra.rules <- apriori(as(alpha, "transactions")) # transactions input
# bravo
b.bin.rules <- apriori(bravo) # binary
b.log.rules <- apriori(bravo>0.5) # logical
b.tra.rules <- apriori(as(bravo, "transactions")) # transactions
Apriori quality outputs were the same for all 6:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Apriori remaining output for alpha binary data input:
Absolute minimum support count: 19
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4 item(s), 199 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [32 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
Apriori remaining output was:
Only the bravo dataset with the logial data format input worked as expected. Results truncated here to save space. Showing mid-output.
inspect(b.log.rules)
lhs rhs support confidence coverage lift count
[6] {b1, b2} => {b3} 0.1306533 0.8666667 0.1507538 1.077917 26
[7] {b1, b2} => {b4} 0.1457286 0.9666667 0.1507538 1.241075 29
[8] {b2, b3} => {b4} 0.2713568 0.9310345 0.2914573 1.195328 54
All other inspected rules had =[0,1]
following each variable name, and rule metrics all equaled 1. Why is this?
Results truncated here to save space. Showing mid-output.
Example below is from alpha binary and was the same for alpha transactions. Bravo only differed by changing the variable letters from a to b.
inspect(a.bin.rules)
lhs rhs support confidence coverage lift count
[17] {a1=[0,1], a2=[0,1]} => {a3=[0,1]} 1 1 1 1 199
[18] {a1=[0,1], a3=[0,1]} => {a2=[0,1]} 1 1 1 1 199
[19] {a2=[0,1], a3=[0,1]} => {a1=[0,1]} 1 1 1 1 199
I decreased Support and Confidence as low as 0.01 each and the results were the same for all versions. The only exception was bravo logical, which still worked and just had a up to 32 working rules. Code update examples:
b.bin.rules <- apriori(bravo, parameter = list(supp = 0.01, conf = 0.01)) # binary
b.log.rules <- apriori(bravo>0.5, parameter = list(supp = 0.01, conf = 0.01)) # logical
b.tra.rules <- apriori(as(bravo, "transactions"), parameter = list(supp = 0.01, conf = 0.01)) # transactions
Dataset comparison
# Sparsity
sum(as.matrix(alpha) == 0) / length(as.matrix(alpha))
[1] 0.5226131
sum(as.matrix(bravo) == 0) / length(as.matrix(bravo))
[1] 0.4120603
# Column totals
colSums(alpha)
a1 a2 a3 a4
91 115 52 122
colSums(bravo)
b1 b2 b3 b4
88 65 160 155
# Row total sums
table( rowSums(alpha) )
0 1 2 3 4
10 56 82 44 7
table( rowSums(bravo) )
0 1 2 3 4
9 29 69 67 25
# Cross-Data Column Correlations
round(sapply(1:ncol(alpha), function(i) cor(alpha[, i], bravo[, i])), 5)
[1] -0.06583 -0.16407 0.00550 -0.00062
# Similarity comparison by element
comparison <- alpha == bravo
sim_count <- sum(comparison)
(sim_count / (nrow(alpha) * ncol(alpha))) * 100
[1] 44.72362
Reproducible datasets
alpha <- structure(list(a1 = c(1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L,
0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L,
0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L,
0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L,
0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L,
1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L,
1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L), a2 = c(0L,
0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L,
0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L,
0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L,
0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 0L,
1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L,
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L,
0L, 1L, 0L, 1L, 0L, 0L), a3 = c(0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L),
a4 = c(1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L,
1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L,
0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 0L,
0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L,
1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 1L,
0L, 1L, 0L, 1L, 1L, 1L)), row.names = c(NA, -199L), class = "data.frame")
bravo <- structure(list(b1 = c(0L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L,
0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L,
0L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L,
0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L,
1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), b2 = c(0L,
1L, 0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L,
1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L,
1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 1L, 0L,
0L, 0L, 1L, 1L, 1L, 0L), b3 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L,
0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L,
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L,
1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L),
b4 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L,
1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L,
1L, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 1L, 1L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L,
0L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L,
0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 1L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-199L))
Your data is encoded as 0 and 1. arules
needs this encoded as TRUE and FALSE, so using > .5
is the right step. Also, the rules in the data have very low confidence, so you need to change the default of .8
. Here is code to create rules for alpha
. You can use similar code for bravo
.
> # create transactions and make sure they look OK
> tr_alpha <- as(alpha > .5, "transactions")
> summary(tr_alpha)
transactions as itemMatrix in sparse format with
199 rows (elements/itemsets/transactions) and
4 columns (items) and a density of 0.4773869
most frequent items:
a4 a2 a1 a3 (Other)
122 115 91 52 0
element (itemset/transaction) length distribution:
sizes
0 1 2 3 4
10 56 82 44 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 2.00 1.91 3.00 4.00
includes extended item information - examples:
labels
1 a1
2 a2
3 a3
> # mine rules with a reduced confidence threshold
> rules <- apriori(tr_alpha, support = 0.1, confidence = .5)
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext
0.5 0.1 1 none FALSE TRUE 5 0.1 1 10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 19
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4 item(s), 199 transaction(s)] done [0.00s].
sorting and recoding items ... [4 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 done [0.00s].
writing ... [11 rule(s)] done [0.00s].
creating S4 object ... done [0.00s].
> inspect(rules)
lhs rhs support confidence coverage lift count
[1] {} => {a2} 0.5778894 0.5778894 1.0000000 1.0000000 115
[2] {} => {a4} 0.6130653 0.6130653 1.0000000 1.0000000 122
[3] {a3} => {a1} 0.1306533 0.5000000 0.2613065 1.0934066 26
[4] {a3} => {a2} 0.1407035 0.5384615 0.2613065 0.9317726 28
[5] {a3} => {a4} 0.1457286 0.5576923 0.2613065 0.9096784 29
[6] {a1} => {a2} 0.2412060 0.5274725 0.4572864 0.9127568 48
[7] {a1} => {a4} 0.3015075 0.6593407 0.4572864 1.0754819 60
[8] {a2} => {a4} 0.3266332 0.5652174 0.5778894 0.9219530 65
[9] {a4} => {a2} 0.3266332 0.5327869 0.6130653 0.9219530 65
[10] {a1, a2} => {a4} 0.1507538 0.6250000 0.2412060 1.0194672 30
[11] {a1, a4} => {a2} 0.1507538 0.5000000 0.3015075 0.8652174 30