I am trying to understand how the chi2 function is computed for the following input.
sklearn.feature_selection.chi2([[1, 2, 0, 0, 1],
[0, 0, 1, 0, 0],
[0, 0, 0, 2, 1]], [True, False, False])
I get the following result [2, 4, 0.5, 1, 0.25]
for chi2.
I found already the following formula for its computation on wikipedia (x_i also being referred to as observed and m_i referred to as expected) but I do not know, how to apply it.
What I understand is that I have three categories of input (rows) and four features (columns) and the chi2 function returns whether there is a correlation between the feature and the class. The feature represented by the first column occurs twice in the first category and gets a chi2 value of 4.
What I think I have figured out is that
False
seem to be somehow combined but I have not yet figured out how.If anybody can help me out that would be highly appreciated. Thanks!
I just looked into the sources of scikit-learn. And the calculation is actually fairly straight-forward. In my example, we have two classes (True and False). For the second class, we have two samples ([0, 0, 1, 0, 0]
and [0, 0, 0, 2, 1]
).
We first some up the columns for each class which gives the observed values:
True: [1, 2, 0, 0, 1]
False: [0, 0, 1, 2, 1]
To compute the expected values, we compute the sum of all columns (i.e., the total count that the feature was observed over all classes) which gives [1, 2, 1, 2, 2]
. If we assume there is no correlation between a feature and the class it was found in, the distribution must be according of these values must correspond to the number of samples we have. I.e., 1/3
of the values should be found in the True
class and 2/3
in the False
class, which gives the expected values:
True: 1/3 * [1, 2, 1, 2, 2] = [1/3 2/3 1/3 2/3 2/3]
False: 2/3 * [1, 2, 1, 2, 2] = [2/3 4/3 2/3 4/3 4/3]
Now chi2 can be computed for each column, as an example for the most interesting last column:
(1-2/3)^2 / (2/3) + (1-4/3)^2 / (4/3) = 1/6 + 1/12 = 1/4 = 0.25
The error of 0.25 is relatively small, therefore, as one would expect, this feature is independent from the class.