I am new to Weka, I have 2-class data to classify. I could classify it using the weighting (word occurrences, TFIDF or word presence ). I wanted to improve the accuracy of the classifier using the feature selection mechanism integrated in Weka as follows:
BufferedReader trainReader = new BufferedReader(new FileReader(dataSource));
trainInsts = new Instances(trainReader);
trainInsts.setClassIndex(trainInsts.numAttributes() - 1);
// I am using the filter to convert the data from string to numeric
StringToWordVector STWfilter = new StringToWordVector();
FilteredClassifier model = new FilteredClassifier();
int n = 400; // number of features to select
AttributeSelection attributeSelection = new AttributeSelection();
ranker = new Ranker();
infoGainAttributeEval = new InfoGainAttributeEval();
trainInsts = Filter.useFilter(trainInsts, attributeSelection);
Evaluation eval = new Evaluation(trainInsts);
eval.crossValidateModel(model, trainInsts, folds, new Random(1));
This works and I could see slight improvements against using the standard weighting methods such as (word occurrence). I am not sure if what I did is correct. Because I feel the feature selection method is same as the weighting methods. Also must I give the "n" number of feature I should have? this is influence the result of the classifier significantly, how this can be set, for example when I have 3000 instances, how many feature I should select? also is there any way in Weka to obtain the number of feature (word) I have in my data? for example with 2000 instances, the best accuracy was with n=400 .
Any comments?
Thanks in advance
Aswering your questions one by one:
. This means that all features scoring more than 0.0
will be kept, as they provide at least a bit of predictive information. You can raise that score upto 1.0
in the case of Information Gain; the higher the threshold, the less features you will keep. Additionally, a rule of thumb that has been used in the literature on text classification (see e.g. Yang & Pedersen paper) is keeping around 1-10% of the features. In Information Retrieval, Salton stated that those terms with a Document Frequency of 1 to 10% of the number of documents were more discriminant (but Information Retrieval is about search, which is not supervised).So, summarizing: you are doing it right -- keep on with attribute selection, but for simplicity, state 0.0 as the minimum threshold for Information Gain.