I'm just doing something basic from https://www.mathsisfun.com/data/chi-square-test.html
Which pet do you prefer?
P value is 0.043
I get an array of pValues: [0.157299207050285,0.157299207050285] I don't understand that
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import ChiSquareTest
data = [(0.0, Vectors.dense(207, 282)),
(1.0, Vectors.dense(231, 242))]
df = spark.createDataFrame(data, ["label", "features"])
r = ChiSquareTest.test(df, "features", "label").head()
print("pValues: " + str(r.pValues))
print("degreesOfFreedom: " + str(r.degreesOfFreedom))
print("statistics: " + str(r.statistics))
0.0 is male and 1.0 is female
What am I doing wrong?
PySpark's ChiSquareTest is expecting the input data in a slightly different format.
If we assume the following feature encoding :
And the frequency of each feature as :
You need to rewrite the input dataframe as :
data = [(0.0, Vectors.dense(2.0)) for x in range(207)] + [(0.0, Vectors.dense(4.0)) for x in range(231)]\
+ [(1.0, Vectors.dense(2.0)) for x in range(282)] + [(1.0, Vectors.dense(4.0)) for x in range(242)]
df = spark.createDataFrame(data, ["label", "features"])
df.show()
# +-----+--------+
# |label|features|
# +-----+--------+
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# | 0.0| [2.0]|
# +-----+--------+
If you then run ChiSquareTest
, you will see the expected result.
r = ChiSquareTest.test(df, "features", "label")
r.show(truncate=False)
# +---------------------+----------------+-------------------+
# |pValues |degreesOfFreedom|statistics |
# +---------------------+----------------+-------------------+
# |[0.04279386669738339]|[1] |[4.103526475356584]|
# +---------------------+----------------+-------------------+