Search code examples
rlabelapriori

How to extract the longest apriori rule (association rule)


When use the following example:

library("arules")
data("Adult")
## Mine association rules.
rules <- apriori(Adult,parameter = list(supp = 0.5, conf = 0.9, target = "rules"))
> labels(rules)

You will see that following rules:

[5] "{sex=Male} => {capital-gain=None}"  
[20] "{race=White,sex=Male} => {capital-gain=None}"
[22] "{sex=Male,native-country=United-States} => {capital-gain=None}" 

have the same RHS but are different in their LHS. I would like to get the longest LHS rules only and to omit the short ones. In the above mentioned example I would like to omit rule [5] since it is included in [20] and [22]. ({sex=Male} is included in [20] and [22]). I would like to stay with the longest rules only (in other examples the longest can have 3 or more components).


Solution

  • Use is.subset to get a logical matrix, and use that matrix to locate non-subsets:

    subsets <- is.subset(rules, proper = TRUE)
    subsets[lower.tri(subsets, diag=TRUE)] <- 0 # set lower triangle to 0
    notsubsets <- rowSums(subsets) == 0L
    labels(rules[notsubsets])
    
    
    # [1] "{capital-gain=None,hours-per-week=Full-time} => {capital-loss=None}"                      
    # [2] "{capital-loss=None,hours-per-week=Full-time} => {capital-gain=None}"                      
    # [3] "{race=White,sex=Male} => {capital-gain=None}"                                             
    # [4] "{race=White,sex=Male,native-country=United-States} => {capital-loss=None}"                
    # [5] "{race=White,sex=Male,capital-loss=None} => {native-country=United-States}"                
    # [6] "{sex=Male,capital-loss=None,native-country=United-States} => {race=White}"                
    # [7] "{sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}"         
    # [8] "{workclass=Private,race=White,native-country=United-States} => {capital-loss=None}"       
    # [9] "{workclass=Private,race=White,capital-loss=None} => {native-country=United-States}"       
    #[10] "{workclass=Private,race=White,capital-gain=None} => {capital-loss=None}"                  
    #[11] "{workclass=Private,race=White,capital-loss=None} => {capital-gain=None}"                  
    #[12] "{workclass=Private,capital-gain=None,native-country=United-States} => {capital-loss=None}"
    #[13] "{workclass=Private,capital-loss=None,native-country=United-States} => {capital-gain=None}"
    #[14] "{race=White,capital-gain=None,native-country=United-States} => {capital-loss=None}"       
    #[15] "{race=White,capital-loss=None,native-country=United-States} => {capital-gain=None}"       
    #[16] "{race=White,capital-gain=None,capital-loss=None} => {native-country=United-States}"
    

    is.subset counts the right-hand side when assessing whether it's a duplicate, and this is an issue with this approach. As mentioned in a comment, the above approach missed the rule {sex=Male,native-country=United-States} => {capital-gain=None}:

    labels(rules[c(22, 43)])
    #[1] "{sex=Male,native-country=United-States} => {capital-gain=None}"                  
    #[2] "{sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}"
    is.subset(rules[22], rules[43])
    

    To get these cases, you can use <= 1L instead of == 0L, but then you'll get a false positive, too ("{sex=Male,capital-gain=None} => {capital-loss=None}" is a subset of {sex=Male,capital-gain=None,native-country=United-States} => {capital-loss=None}.