I've run an InfoGain evaluation on my dataset, with a Ranker on threshold 0.1.
My output via the GUI says:
Search Method:
Attribute ranking.
Threshold for discarding attributes: 0.1
Attribute Evaluator (supervised, Class (nominal): 23 class):
Information Gain Ranking Filter
Ranked attributes:
0.141 2 nr_visits
Selected attributes: 2 : 1
In my java implementation, I do the same thing:
Ranker ranker = new Ranker();
ranker.setGenerateRanking(true);
ranker.setThreshold(0.1);
AttributeSelection attsel = new AttributeSelection();
InfoGainAttributeEval eval = new InfoGainAttributeEval();
attsel.setEvaluator(eval);
attsel.setSearch(ranker);
attsel.SelectAttributes(instances);
int[] ranked_attr = attsel.selectedAttributes();
double[][] rawscores = attsel.rankedAttributes();
Where I get similar output:
[1, 21]
(with 1
being the nr_visits
feature, and 21
another)21
. It has the 1
, and then another feature with a score lower than my threshold.What gives? Are there one or two selected features? Is this a bug in weka 3.8.4?
Thanks to Eibe on the mailing list:
AFAIK, the set of indices returned by selectedAttributes() includes the index of the class attribute. I assume that attribute 22 in your data is the class attribute. There is no score for the class attribute because it is the attribute that we are trying to predict.
Because yes, the 21
was indeed my class index, which is zero-based in code, one-based in the GUI, which is why I didn't immediately notice.