Search code examples
pythonmachine-learningscikit-learn

Computation of Chi-Square Test


I am trying to understand how the chi2 function is computed for the following input.

sklearn.feature_selection.chi2([[1, 2, 0, 0, 1],
                                [0, 0, 1, 0, 0],
                                [0, 0, 0, 2, 1]], [True, False, False])

I get the following result [2, 4, 0.5, 1, 0.25] for chi2.

I found already the following formula for its computation on wikipedia (x_i also being referred to as observed and m_i referred to as expected) but I do not know, how to apply it.

enter image description here

What I understand is that I have three categories of input (rows) and four features (columns) and the chi2 function returns whether there is a correlation between the feature and the class. The feature represented by the first column occurs twice in the first category and gets a chi2 value of 4.

What I think I have figured out is that

  1. the columns are independent of each other which makes sense
  2. if I omit the third row, the expected values would be sums of the columns and observed values simply the values in the respective cells, except this does not work for the last column
  3. the 2 columns with False seem to be somehow combined but I have not yet figured out how.

If anybody can help me out that would be highly appreciated. Thanks!


Solution

  • I just looked into the sources of scikit-learn. And the calculation is actually fairly straight-forward. In my example, we have two classes (True and False). For the second class, we have two samples ([0, 0, 1, 0, 0] and [0, 0, 0, 2, 1]).

    We first some up the columns for each class which gives the observed values:

     True: [1, 2, 0, 0, 1]
    False: [0, 0, 1, 2, 1]
    

    To compute the expected values, we compute the sum of all columns (i.e., the total count that the feature was observed over all classes) which gives [1, 2, 1, 2, 2]. If we assume there is no correlation between a feature and the class it was found in, the distribution must be according of these values must correspond to the number of samples we have. I.e., 1/3 of the values should be found in the True class and 2/3 in the False class, which gives the expected values:

     True: 1/3 * [1, 2, 1, 2, 2] = [1/3 2/3 1/3 2/3 2/3]
    False: 2/3 * [1, 2, 1, 2, 2] = [2/3 4/3 2/3 4/3 4/3]
    

    Now chi2 can be computed for each column, as an example for the most interesting last column:

    (1-2/3)^2 / (2/3) + (1-4/3)^2 / (4/3) = 1/6 + 1/12 = 1/4 = 0.25
    

    The error of 0.25 is relatively small, therefore, as one would expect, this feature is independent from the class.