scala apache-spark machine-learning artificial-intelligence

Exception while trying to explain model with MMLSpark's scala LIME library

I am trying to explain the predictions made by my XGboost model using MMLSparks Lime package for scala.

This is my first time using LIME library, I am able to perform a fit operation on the dataset and when I am trying to perform the transform operation, the program stops with an exception,

Caused by: java.lang.ClassCastException: org.apache.spark.ml.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.DenseVector

I have around 200 features and many of them contain zero as its feature value.

Solution

You are likely using VectorAssembler to create your feature vector column. The transform function outputs a sparse vector if there are lots of zeros in your feature set to save computational space. This causes the error for LIME.

More info on VectorAssembler output - Spark ML VectorAssembler returns strange output

The solution is to convert the column back to a dense vector in order for mmlspark LIME to interpret.

import org.apache.spark.sql.functions.udf
import org.apache.spark.ml.linalg.Vector

val asDense = udf((v: Vector) => v.toDense)

featuresDF.withColumn("features", asDense(col("features")))

Then you can fit your model.