Search code examples
pysparkp-valuechi-squared

How to properly use the ChiSquareTest function in Pyspark?


I'm just doing something basic from https://www.mathsisfun.com/data/chi-square-test.html

Which pet do you prefer?

P value is 0.043

I get an array of pValues: [0.157299207050285,0.157299207050285] I don't understand that

from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest

data = [(0.0, Vectors.dense(207, 282)),
        (1.0, Vectors.dense(231, 242))]
df = spark.createDataFrame(data, ["label", "features"])

r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))

0.0 is male and 1.0 is female

What am I doing wrong?


Solution

  • PySpark's ChiSquareTest is expecting the input data in a slightly different format.

    If we assume the following feature encoding :

    • Cat = 0.0
    • Dog = 1.0
    • Men = 2.0
    • Women = 4.0

    And the frequency of each feature as :

    • freq(Cat, Men) = 207
    • freq(Cat, Women) = 231
    • freq(Dog, Men) = 282
    • freq(Dog, Women) = 242

    You need to rewrite the input dataframe as :

    data = [(0.0, Vectors.dense(2.0)) for x in range(207)] + [(0.0, Vectors.dense(4.0)) for x in range(231)]\
            + [(1.0, Vectors.dense(2.0)) for x in range(282)] + [(1.0, Vectors.dense(4.0)) for x in range(242)]
    df = spark.createDataFrame(data, ["label", "features"])
    
    df.show()
    
    # +-----+--------+
    # |label|features|
    # +-----+--------+
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # |  0.0|   [2.0]|
    # +-----+--------+
    

    If you then run ChiSquareTest, you will see the expected result.

    r = ChiSquareTest.test(df, "features", "label")
    
    r.show(truncate=False)
    
    # +---------------------+----------------+-------------------+
    # |pValues              |degreesOfFreedom|statistics         |
    # +---------------------+----------------+-------------------+
    # |[0.04279386669738339]|[1]             |[4.103526475356584]|
    # +---------------------+----------------+-------------------+