Search code examples
wekatext-miningapriori

Apriori in WEKA


I'm new to all these Data mining, WEKA Tool etc.,

In my academic project I have to deal with bug reports. I have them in my SQL Server. I took the Bug summary attribute and applied tokenization,stop words removal and stemming techniques.

All the stemmed words in the summary are stored in database ; separated. Now I have to apply Frequent pattern mining algorithm and find out frequent item sets by using WEKA tool. I have my arff file like this.

@relation ItemSets

@attribute bugid integer
@attribute summary string

@data
755113,enhanc;keep;log;recommend;share
759414,access;review;social
763806,allow;intrus;less;provid;shrunken;sidebar;social;specifi
767221,datacloneerror;deeper;dig;framework;jsm
771353,document;integr;provid;secur;social
785540,avail;determin;featur;method;provid;social;whether
785591,chat;dock;horizont;nest;overlap;scrollbar
787767,abus;api;implement;perform;runtim;warn;worker

After opening it in Weka, under the Associate tab of WEKA Explorer I'm unable to start the process(Start button is disabled) with Apriori selected.

Now please suggest me how to find frequent itemsets on the summary attribute using WEKA. I.m in need of serious help. Help will be appreciated. Thanks in advance!


Solution

  • The reason why Apriori is not available using your file in Weka is that Apriori only allows nominal attribute values. What sort of rules are you trying to find? Could you give an example of rules you want to obtain?

    values_you_want_to_be_the_antecedent_part_of_your_rule ==> values_you_want_to_be_the_consequent_part_of_your_rule
    

    Changing your attributes to nominal like this

    @relation ItemSets
    
    @attribute bugid {755113, 759414, 763806}
    @attribute summary {'enhanc;keep;log;recommend;share', 'access;review;social', 'allow;intrus;less;provid;shrunken;sidebar;social;specifi'}
    
    @data
    755113,'enhanc;keep;log;recommend;share'
    759414,'access;review;social'
    763806,'allow;intrus;less;provid;shrunken;sidebar;social;specifi'
    

    will only give you rules like

    bugid=755113 1 ==> summary=enhanc;keep;log;recommend;share 1    <conf:(1)> lift:(3) lev:(0.22)
    

    If you're looking for frequent itemsets among the summary words, the bugid is irrelevant and you can remove it from your file. Apriori is used to obtain association rules e.g. enhanc, keep gives log with support X and confidence Y. To find frequent itemsets, you need to restructure your data so that each summary word is an attribute with values true/false or true/missing, see this question.

    Try the following file in Weka. Select Associate, choose Apriori, double-click on the white input field next to the Choose button. There, set outputItemSets to true. In the console output, you will see all frequent itemsets and all obatined rules with sufficient support.

    @relation ItemSets
    
    @attribute enhanc {true}
    @attribute keep {true}
    @attribute log {true}
    @attribute recommend {true}
    @attribute share {true}
    @attribute access {true}
    @attribute review {true}
    @attribute social {true}
    @attribute allow {true}
    @attribute intrus {true}
    @attribute less {true}
    @attribute provid {true}
    @attribute shrunken {true}
    @attribute sidebar {true}
    @attribute specifi {true}
    
    
    @data
    true,true,true,true,true,?,?,?,?,?,?,?,?,?,?
    ?,?,?,?,?,true,true,true,?,?,?,?,?,?,?
    ?,?,?,?,?,?,?,true,true,true,true,true,true,true,true
    

    The questionmarks ? represent a missing value.