All,
I have a simple data frame like below
I am using RFormula api to make a model matrix as below
val formula = "dep ~ indep"
val rF = new RFormula().setFormula(formula).setFeaturesCol("features").setLabelCol("label")
val rfModel = rF.fit(df)
where rfModel is of type RFormulaModel. According to the docs here
the mapping of the categorical variable "indep" should be available for access from this object as pipelineModel but this seems to be a private member.
My question is how do i get the labels and corresponding indices from the RFormulaModel object? I know I can use the metadata of the transformed dataframe and do string manipulation but is there a straightforward way to do this?
Thanks for any help!
Came up with a hack where I had to write the RFormulaModel to the disk and then read the pipelineModel part back in as PipelineModel. From there I have access to the StringIndexerModel stages as shown here
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.feature.StringIndexerModel
rfModel.write.overwrite.save("/rfModel")
val pModel = PipelineModel.read.load("/rfModel/pipelineModel")
val strIndexers = pModel.stages.filter(stage => stage.isInstanceOf[StringIndexerModel])
val labelMaps = strIndexers.map(e => { val i = e.asInstanceOf[StringIndexerModel]; (i.getInputCol, i.labels)})