apache-spark data-mining apache-spark-mllib market-basket-analysis

Avoid some unexpected results from Association Rules

I'm trying to extract some association rules from this dataset:

49
70
27,66
6
27
66,8,64
32
82
66
71
44
1
33
17
31,83
50,29
22
72
8
8,16
56
83,61
85,63,37
50,57
2
50
96,6
73
57
12
62
96
3
47,50,73
35
85,45
25,96,22,17
85
24
17,57
34,4
60,96,45
25
85,66,73
30
14
73,85
64
48
5
37
13,55
37,17

I've this code:

val transactions = sc.textFile("/user/cloudera/dataset1")

import org.apache.spark.mllib.fpm.AssociationRules
import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset

val freqItemsets = transactions.flatMap(xs => 
    (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).map(x => (x.toList, 1L))
  ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}

val ar = new AssociationRules().setMinConfidence(0.4)

val results = ar.run(freqItemsets)

results.collect().foreach { rule =>
  println("[" + rule.antecedent.mkString(",")
    + "=>"
    + rule.consequent.mkString(",") + "]," + rule.confidence)
    }

But I'm getting some unexpected lines in my output:

[2,9=>5],0.5
[8,5,,,3=>6],1.0
[8,5,,,3=>7],0.5
[8,5,,,3=>7],0.5
[,,,=>6],0.5
[,,,=>7],0.5
[,,,=>5],0.5
[,,,=>3],0.5
[4,3=>7],1.0
[4,3=>,,,],1.0
[4,3=>,,,],1.0
[4,3=>5],1.0
[4,3=>7,7],1.0
[4,3=>7,7],1.0
[4,3=>0],1.0

Why I'm getting outputs like this:

[,,,=>3],0.5

I'm not understanding the issue... Anyone knows how to solve this problem?

Many Thanks!

Solution

All of these results should be unexpected, because you have a bug in your code!

You need to create combinations of the items. As it stands, your code is creating combinations of characters in the string (like "25,96,22,17"), which of course won't give the right result (and that's why you see the "," as an element).

To fix, add: val freqItemsets = transactions.map(_.split(",")).

So instead of

val freqItemsets = transactions.flatMap(xs => 
    (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).map(x => (x.toList, 1L))
  ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}

You have:

val freqItemsets = transactions.map(_.split(",")).flatMap(xs => 
    (xs.combinations(1) ++ xs.combinations(2) ++ xs.combinations(3) ++ xs.combinations(4) ++ xs.combinations(5)).filter(_.nonEmpty).map(x => (x.toList, 1L))   ).reduceByKey(_ + _).map{case (xs, cnt) => new FreqItemset(xs.toArray, cnt)}

Which will give the expected:

[96,17=>22],1.0
[96,17=>25],1.0
[85,37=>63],1.0
[47,73=>50],1.0
[31=>83],1.0
[60,45=>96],1.0
[60=>45],1.0
[60=>96],1.0
[96,45=>60],1.0
[22,17=>25],1.0
[22,17=>96],1.0
[66,8=>64],1.0
[63,37=>85],1.0
[66,64=>8],1.0
[25,22,17=>96],1.0
[27=>66],0.5
[96,22,17=>25],1.0
[61=>83],1.0
[64=>66],0.5
[64=>8],0.5
[45=>60],0.5
[45=>96],0.5
[45=>85],0.5
[6=>96],0.5
[47=>73],1.0
[47=>50],1.0
[50,73=>47],1.0
[96,22=>17],1.0
[96,22=>25],1.0
[66,73=>85],1.0
[8,64=>66],1.0
[29=>50],1.0
[83=>31],0.5
[83=>61],0.5
[25,96,17=>22],1.0
[85,66=>73],1.0
[25,96,22=>17],1.0
[25,96=>17],1.0
[25,96=>22],1.0
[22=>17],0.5
[22=>96],0.5
[22=>25],0.5
[85,73=>66],1.0
[55=>13],1.0
[60,96=>45],1.0
[63=>37],1.0
[63=>85],1.0
[25,22=>17],1.0
[25,22=>96],1.0
[16=>8],1.0
[25=>96],0.5
[25=>22],0.5
[25=>17],0.5
[34=>4],1.0
[85,63=>37],1.0
[47,50=>73],1.0
[13=>55],1.0
[4=>34],1.0
[25,17=>22],1.0
[25,17=>96],1.0