Search code examples
scalaapache-spark-ml

SparkML - Creating a df(feature, feature_importance) of a RandomForestRegressionModel


I am training a Random Forest model in the following way:

//Indexer
val stringIndexers = categoricalColumns.map { colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Idx")
    .setHandleInvalid("keep")
    .fit(training)
}

//HotEncoder
val encoders = featuresEnconding.map { colName =>
  new OneHotEncoderEstimator()
    .setInputCols(Array(colName + "Idx"))
    .setOutputCols(Array(colName + "Enc"))
    .setHandleInvalid("keep")
}  

//Adding features into a feature vector column   
val assembler = new VectorAssembler()
              .setInputCols(featureColumns)
              .setOutputCol("features")


val rf = new RandomForestRegressor()
              .setLabelCol("label")
              .setFeaturesCol("features")

val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)

val pipelineRF = new Pipeline()
                 .setStages(stepsRF)


val paramGridRF = new ParamGridBuilder()
                  .addGrid(rf.maxBins, Array(800))
                  .addGrid(rf.featureSubsetStrategy, Array("all"))
                  .addGrid(rf.minInfoGain, Array(0.05))
                  .addGrid(rf.minInstancesPerNode, Array(1))
                  .addGrid(rf.maxDepth, Array(28,29,30))
                  .addGrid(rf.numTrees, Array(20))
                  .build()


//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")

//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)

//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)

What I would like now is to get the importance of each of the features in the model after the training.

I am able to get the importance of each feature as an Array[Double] doing like this:

val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]

val size = bestModel.stages.size-1

val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray

But I only get the importance of each feature and a numerical index, but I don't know what is the feature name inside my model which correspond to each importance value.

I also would like to mention that since I am using a hotencoder, the final amount of feature is much larger than the original featureColumns array.

How can I extract the features names used during the training of my model?


Solution

  • I found this possible solution:

    import org.apache.spark.ml.attribute._
    
    val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]
    
    val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
    val schema = predictions.schema
    
    val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
    val mfeatures = featureAttrs.map(_.name.get)
    
    
    val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
    .orderBy(desc("Importance"))
    display(mdf)