To handle new and unseen labels in a spark ml pipeline I want to use most frequent imputation. if the pipeline consists of 3 steps
Assuming (1) and (2,3) and (4,5) constitute separate pipelines
the following
val fittedLabels = pipeline23.stages collect { case a: StringIndexerModel => a }
val result = categoricalColumns.zipWithIndex.foldLeft(validationData) {
(currentDF, colName) =>
currentDF
.withColumn(colName._1, when(currentDF(colName._1) isin (fittedLabels(colName._2).labels: _*), currentDF(colName._1))
.otherwise(lit(null)))
}.drop("replace")
to replace new/unseen labels with null
However, this setup is very ugly. as tools like CrossValidator no longer work (as I can't supply a single pipeline)
How can I access the fitted labels within the pipeline to build an in Transformer which handles setting new values to null?
Do you see a better approach to accomplish handling new values? I assume most frequent imputation is ok i.e. for a dataset with around 90 columns only very few columns will contain an unseen label.
I finally realized that this functionality is required to reside in the pipeline to work properly, i.e. requires an additional new PipelineStage component.