Search code examples
orange

Orange3 Frequent Itemsets performance


I just realized that the performance of Frequent Itemsets is strongly correlated with number of item per basket. I run the following code:

import datetime
from datetime import datetime
from orangecontrib.associate.fpgrowth import * 
%%time
T = [[1,3, 4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21]]    
itemsets = frequent_itemsets(T, 1)
a=list(itemsets)

As I increased the number of item in T the running time increased as follow:

Item    running time
21        3.39s
22        9.14s
23        15.8s 
24        37.4s
25        1.2 min
26        10 min
27        35 min
28        95 min

For 31 item sets it took 10 hours without returning any results. I am wondering if there is anyway to run it for more than 31 items in reasonable time? In this case I just need pairwise item set (A-->B) while my understanding is frequent_itemsets count all possible combination and that is probably why it is running time is highly correlated with number of items. Is there any way to tell the method to limit the number item to count like instead of all combination just pairwise?


Solution

  • You could use other software that allows to specify constraints on the itemsets such as a length constraints. For example, you can consider the SPMF data mining library (disclosure: I am the founder), which offers about 120 algorithms for itemset and pattern mining. It will let you use FPGrowth with a length constraint. So you could for example mine only the patterns with 2 items or 3 items. You could also try other features such as mining association rules. That software works on text file and can be called for the command line, and is quite fast.