Search code examples
apache-sparkpysparkapache-spark-ml

IllegalArgumentException: Column must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'


I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns:

from pyspark.ml.stat import ChiSquareTest

r = ChiSquareTest.test(df, 'feature1', 'feature2')

However, it gives me the error:

IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

The datatype for feature1 is:

feature1: double (nullable = true)

Could you please help me with this regard?


Solution

  • spark-ml is not the typical statistics library. It is very ML oriented. Therefore it assumes that you will want to run a test between a label and a feature or a group of features.

    Therefore, similarly to when you train a model, you need to assemble the features you want to test against the label.

    In your case, you can just assemble feature1 as follows:

    from pyspark.ml.stat import ChiSquareTest
    from pyspark.ml.feature import VectorAssembler
    
    data = [(1, 2), (3, 4), (2, 1), (4, 3)]
    df = spark.createDataFrame(data, ['feature1', 'feature2'])
    assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')
    
    ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)
    

    Just in case, the code in scala:

    import org.apache.spark.ml.stat.ChiSquareTest
    import org.apache.spark.ml.feature.VectorAssembler
    
    val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
        .toDF("features", "feature2", "feature3")
    val assembler = new VectorAssembler()
        .setInputCols(Array("feature1"))
        .setOutputCol("features")
    
    ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)