I have a dataset in which I used the Information gain feature selection method in WEKA to get the important features. Below is the output I got.
Ranked attributes:
0.97095 1 Opponent
0.41997 11 Field_Goals_Made
0.38534 24 Opp_Free_Throws_Made
0.00485 4 Home
0 8 Field_Goals_Att
0 12 Opp_Total_Rebounds
0 10 Def_Rebounds
0 9 Total_Rebounds
0 6 Opp_Field_Goals_Made
0 7 Off_Rebounds
0 14 Opp_3Pt_Field_Goals_Made
0 2 Fouls
0 3 Opp_Blocks
0 5 Opp_Fouls
0 13 Opp_3Pt_Field_Goals_Att
0 29 3Pt_Field_Goal_Pct
0 28 3Pt_Field_Goals_Made
0 22 3Pt_Field_Goals_Att
0 25 Free_Throws_Made
Which tells me that all features with score 0 can be ignored, is it correct?
Now when I tried the Wrapper subset evaluation in WEKA, I got selected attribute which were ignored in info gain method (i.e whose score was 0). Below is the output
Selected attributes: 3,8,9,11,24,25 : 6
Opp_Blocks
Field_Goals_Att
Total_Rebounds
Field_Goals_Made
Opp_Free_Throws_Made
Free_Throws_Made
I want to understand, what is the reason that the attributes ignored by info gain are considered strongly by wrapper subset evaluation method?
To understand what's happening, it helps to understand first what the two feature selection methods are doing.
The information gain of an attribute tells you how much information with respect to the classification target the attribute gives you. That is, it measures the difference in information between the cases where you know the value of the attribute and where you don't know the value of the attribute. A common measure for the information is Shannon entropy, although any measure that allows to quantify the information content of a message will do.
So the information gain depends on two things: how much information was available before knowing the attribute value, and how much was available after. For example, if your data contains only one class, you already know what the class is without having seen any attribute values and the information gain will always be 0. If, on the other hand, you have no information to start with (because the classes you want to predict are represented in equal quantities in your data), and an attribute splits the data perfectly into the classes, its information gain will be 1.
The important thing to note in this context is that the information gain is a purely information-theoretic measure, it does not consider any actual classification algorithms.
This is what the wrapper method does differently. Instead of analyzing the attributes and targets from an information-theoretic point of view, it uses an actual classification algorithm to build a model with a subset of the attributes and then evaluates the performance of this model. It then tries a different subset of attributes and does the same thing again. The subset for which the trained model exhibits the best empirical performance wins.
There are a number of reasons why the two methods would give you different results (this list is not exhaustive):
a
and b
has no information separately, but a*b
on the other hand, might). Attribute selection will not discover this because it evaluates attributes in isolation, while a classification algorithm may be able to leverage this.b
may provide information on its own, it may not provide any information in addition to a
, which is used higher up in the tree. Therefore b
would appear useful when evaluated according to information gain, but is not used by a tree that "knows" a
first.In practice it's usually a better idea to use a wrapper for attribute selection as it takes the performance of the actual classifier you want to use into account, and different classifier vary widely in usage of information. The advantage of classifier-agnostic measures like information gain is that they are much cheaper to compute.