algorithm data-mining distributed-system apriori

Performance of Frequent Itemset mining

I have implemented apriori algorithm for mining frequent itemset its working fine for sample data but when i have tried to execute it for retail dataset available at http://fimi.ua.ac.be/data/retail.dat which is around 3mb data with 88k transaction and 1600 unique items it takes around 29 hours. I have searched for the cause of performance hit and found that algorithm to generate candidate itemset is taking much time. Can anybody help me for how to improve the performance or is these a normal algorithm behaviour.

Solution

What algorithm are you using, and what are your thresholds?

If you have n 1-itemsets that satisfy your minimum support, Apriori (and many others) must consider n*(n-1)/2 2-itemsets. This of course gets rather expensive. In Apriori, the 2-itemsets often is the largest and most expensive step (depending on your settings, 3-itemsets may be worse, but usually you don't get that far...)

Depending on your data characteristics, Eclat may perform better - or worse. FP-growth is very clever, but seems to be really tricky to implement right and efficiently.

There also exist a myriad of variants, some are probabilistic and may be substantially faster.

But in essence implementation and data set / parameter settings matter. One approach may be better than the other on the same set of data; and implementations may easily have a 10x performance difference. In particular for APRIORI, where many people don't fully understand the pruning tricks used, and end up doing much too much work.

For your example data set (which unfortunately is entirely unhelpful without a dictionary that explains the numbers), my APRIORI finishes in about 1 minute on a low-end Atom CPU with minSupport=1000. The 4-itemsets found were:

32, 38, 39, 48: 1236
32, 39, 41, 48: 1646
36, 38, 39, 48: 1080
38, 39, 41, 48: 1991
38, 39, 48, 110: 1031
38, 39, 48, 170: 1193

but as mentioned before, the parameters matter a lot with APRIORI. APRIORI scales well in the number of transactions (which may be more than fits into main memory), but it suffers from a large number of candidates, so you need to set minSupport high enough.

With minSupport=1000:

1-frequentItemsets (56)
2-frequentItemsets (49)
3-frequentItemsets (24)
4-frequentItemsets (6)
APRIORI runtime: 55401 ms

With minSupport=500:

1-frequentItemsets (185)
2-frequentItemsets (191)
3-frequentItemsets (79)
4-frequentItemsets (13)
5-frequentItemsets (0)
APRIORI runtime: 136594 ms

As you can see, runtime more than doubled. The reason is that the 1-itemsets grew, yielding many more 2-itemset candidates. At 3- and 4-itemsets, the difference is not so big anymore. But overall, 2 Minutes runtime on a really low end CPU isn't bad yet.

Decreasing minimum support to 250:

1-frequentItemsets (587)
2-frequentItemsets (640)
3-frequentItemsets (273)
4-frequentItemsets (54)
5-frequentItemsets (4)
APRIORI runtime: 954984 ms

Now runtime starts exploding. Almost 16 minutes to find the 5-itemsets:

32, 38, 39, 41, 48: 448
36, 38, 39, 41, 48: 334
38, 39, 41, 48, 110: 346
38, 39, 41, 48, 170: 413

as you can see, the 38, 39, 41, 48 4-itemset is playing a key role in this data set (but we already found this in the first run). We can now also give the confidence for 38, 39, 41, 48 -> 32, which will be the most confident rule involving 5 items: approximately 22.5% of transactions involving the first four also involved item 32. But given that AFAICT the meaning of the numbers for this data set is unknown, we do not know if this rule is actually interesting in practise, or just a toy exercise.

Going to minSupport=100, runtime further grows:

1-frequentItemsets (1857)
2-frequentItemsets (2785)
3-frequentItemsets (1475)
4-frequentItemsets (306)
5-frequentItemsets (28)
6-frequentItemsets (0)
APRIORI runtime: 8256507 ms

i.e. over two hours. Most of this time is spent on the 2-itemsets: there are 1.7 million candidates, out of which 2785 were frequent. For the 3-itemsets, one can roughly estimate that there will be just some thousand candidates, but not millions anymore. I have some plans on further improving the implementation by switching between two codepaths depending on the number of candidates. ('''Update:''' with a simpler modification, I further reduced the runtime by a factor of 20. So yes, '''implementation matters''', it can easily make a factor of 100 to 1000 or more - when APRIORI pruning is not fully implemented, for example)

If I ever find time to read up on the Eclat algorithm and reimplement it, I can update this with results. I believe it will be much faster here, and so will FP-growth.