IllegalArgumentException: Column must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns:

from pyspark.ml.stat import ChiSquareTest

r = ChiSquareTest.test(df, 'feature1', 'feature2')

However, it gives me the error:

IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

The datatype for feature1 is:

feature1: double (nullable = true)

Could you please help me with this regard?

Solution

spark-ml is not the typical statistics library. It is very ML oriented. Therefore it assumes that you will want to run a test between a label and a feature or a group of features.

Therefore, similarly to when you train a model, you need to assemble the features you want to test against the label.

In your case, you can just assemble feature1 as follows:

from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler

data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')

ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)

Just in case, the code in scala:

import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler

val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
    .toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
    .setInputCols(Array("feature1"))
    .setOutputCol("features")

ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)