I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns:
from pyspark.ml.stat import ChiSquareTest
r = ChiSquareTest.test(df, 'feature1', 'feature2')
However, it gives me the error:
IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'
The datatype for feature1
is:
feature1: double (nullable = true)
Could you please help me with this regard?
spark-ml
is not the typical statistics library. It is very ML oriented. Therefore it assumes that you will want to run a test between a label and a feature or a group of features.
Therefore, similarly to when you train a model, you need to assemble the features you want to test against the label.
In your case, you can just assemble feature1
as follows:
from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler
data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')
ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)
Just in case, the code in scala:
import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
.toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
.setInputCols(Array("feature1"))
.setOutputCol("features")
ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)