Search code examples
javascalaapache-sparkjava-iolightgbm

Scala java.io toArray error when zipping feature importance vector to column names array


When trying to zip the feature importance vector from lightGBM getfeatureImportances to column names array, i ran into an error below:

import com.microsoft.ml.spark.LightGBMClassificationModel
import org.apache.spark.ml.classification.RandomForestClassificationModel

def getFeatureImportances(inputContainer: PipelineModelContainer): (String, String) = {
    val transformer = inputContainer.pipelineModel.stages.last

    val featureImportancesVector = inputContainer.params match {
        case RandomForestParameters(numTrees, treeDepth, featureTransformer) =>
            transformer.asInstanceOf[RandomForestClassificationModel].featureImportances
        case LightGBMParameters(treeDepth, numLeaves, iterations, featureTransformer) => 
            transformer.asInstanceOf[LightGBMClassificationModel].getFeatureImportances("split")
    }

    val colNames = inputContainer.featureColNames
    val sortedFeatures = (colNames zip featureImportancesVector.toArray).sortWith(_._2 > _._2).zipWithIndex
}

I am getting this error with reference to the last line of my code:

value toArray is not a member of java.io.Serializable

Seems like the light GBM feature importances cannot be transformed to an array. This code works fine if its just the randomForestClassifier feature importance. What other things can i do?


Solution

  • In the two branches of the match block, one returns Array[Double], another returns Vector.

    The common super type of the two types is java.io.Serializable, so Scala inferred the type of the variable featureImportancesVector to that. toArray is not available in that type, despite that the method exists in both cases.

    To fix this is easy, as suggested in the comment, move the .toArray to the featureImportances, so that the type of both branches, and thus the type of the variable, become Array[Double].