machine-learning artificial-intelligence weka feature-selection

Why Information gain feature selection gives zero scores

I have a dataset in which I used the Information gain feature selection method in WEKA to get the important features. Below is the output I got.

Ranked attributes:
0.97095    1 Opponent
0.41997   11 Field_Goals_Made
0.38534   24 Opp_Free_Throws_Made
0.00485    4 Home
0          8 Field_Goals_Att
0         12 Opp_Total_Rebounds
0         10 Def_Rebounds
0          9 Total_Rebounds
0          6 Opp_Field_Goals_Made
0          7 Off_Rebounds
0         14 Opp_3Pt_Field_Goals_Made
0          2 Fouls
0          3 Opp_Blocks
0          5 Opp_Fouls
0         13 Opp_3Pt_Field_Goals_Att
0         29 3Pt_Field_Goal_Pct
0         28 3Pt_Field_Goals_Made
0         22 3Pt_Field_Goals_Att
0         25 Free_Throws_Made

Which tells me that all features with score 0 can be ignored, is it correct?

Now when I tried the Wrapper subset evaluation in WEKA, I got selected attribute which were ignored in info gain method (i.e whose score was 0). Below is the output

Selected attributes: 3,8,9,11,24,25 : 6
                 Opp_Blocks
                 Field_Goals_Att
                 Total_Rebounds
                 Field_Goals_Made
                 Opp_Free_Throws_Made
                 Free_Throws_Made

I want to understand, what is the reason that the attributes ignored by info gain are considered strongly by wrapper subset evaluation method?

Solution

To understand what's happening, it helps to understand first what the two feature selection methods are doing.

The information gain of an attribute tells you how much information with respect to the classification target the attribute gives you. That is, it measures the difference in information between the cases where you know the value of the attribute and where you don't know the value of the attribute. A common measure for the information is Shannon entropy, although any measure that allows to quantify the information content of a message will do.

So the information gain depends on two things: how much information was available before knowing the attribute value, and how much was available after. For example, if your data contains only one class, you already know what the class is without having seen any attribute values and the information gain will always be 0. If, on the other hand, you have no information to start with (because the classes you want to predict are represented in equal quantities in your data), and an attribute splits the data perfectly into the classes, its information gain will be 1.

The important thing to note in this context is that the information gain is a purely information-theoretic measure, it does not consider any actual classification algorithms.

This is what the wrapper method does differently. Instead of analyzing the attributes and targets from an information-theoretic point of view, it uses an actual classification algorithm to build a model with a subset of the attributes and then evaluates the performance of this model. It then tries a different subset of attributes and does the same thing again. The subset for which the trained model exhibits the best empirical performance wins.

There are a number of reasons why the two methods would give you different results (this list is not exhaustive):

A classification algorithm may not be able to leverage all the information that the attributes can provide.
A classification algorithm may implement its own attribute selection internally (for example decision tree/forest learners do this) that considers a smaller subset than attribute selection will yield.
Individual attributes may not be informative, but combinations of them may be (for example perhaps a and b has no information separately, but a*b on the other hand, might). Attribute selection will not discover this because it evaluates attributes in isolation, while a classification algorithm may be able to leverage this.
Attribute selection does not consider the attributes sequentially. Decision trees for example use a sequence of attributes and while b may provide information on its own, it may not provide any information in addition to a, which is used higher up in the tree. Therefore b would appear useful when evaluated according to information gain, but is not used by a tree that "knows" a first.

In practice it's usually a better idea to use a wrapper for attribute selection as it takes the performance of the actual classifier you want to use into account, and different classifier vary widely in usage of information. The advantage of classifier-agnostic measures like information gain is that they are much cheaper to compute.