Search code examples
rmodelformulaapache-spark-mllibpipeline

How do i get the factor to index mappings from RFormula/RFormulaModel in Apache Spark?


All,

I have a simple data frame like below

enter image description here

I am using RFormula api to make a model matrix as below

val formula = "dep ~ indep"
val rF = new RFormula().setFormula(formula).setFeaturesCol("features").setLabelCol("label")
val rfModel = rF.fit(df)

where rfModel is of type RFormulaModel. According to the docs here

the mapping of the categorical variable "indep" should be available for access from this object as pipelineModel but this seems to be a private member.

My question is how do i get the labels and corresponding indices from the RFormulaModel object? I know I can use the metadata of the transformed dataframe and do string manipulation but is there a straightforward way to do this?

Thanks for any help!


Solution

  • Came up with a hack where I had to write the RFormulaModel to the disk and then read the pipelineModel part back in as PipelineModel. From there I have access to the StringIndexerModel stages as shown here

    import org.apache.spark.ml.PipelineModel
    import org.apache.spark.ml.feature.StringIndexerModel
    
    rfModel.write.overwrite.save("/rfModel")
    val pModel = PipelineModel.read.load("/rfModel/pipelineModel")
    
    val strIndexers = pModel.stages.filter(stage => stage.isInstanceOf[StringIndexerModel])
    val labelMaps = strIndexers.map(e  => { val i = e.asInstanceOf[StringIndexerModel]; (i.getInputCol, i.labels)})