Search code examples
scalaapache-sparkapache-spark-ml

Applying IndexToString to features vector in Spark


Context: I have a data frame where all categorical values have been indexed using StringIndexer.

val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name }    

val categoryIndexers = categoricalColumns.map {
  col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed") 
}

Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones).

val assembler = new VectorAssembler()
    .setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns))
    .setOutputCol("features")

After applying the classifier and a few additional steps I end up with a data frame that has label, features, and prediction. I would like expand my features vector to separate columns in order to convert the indexed values back to their original String form.

val categoryConverters = categoricalColumns.zip(categoryIndexers).map {
colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels)
}

Question: Is there a simple way of doing this, or is the best approach to somehow attach the prediction column to the test data frame?

What I have tried:

val featureSlicers = categoricalColumns.map {
  col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed"))
}

Applying this gives me the columns that I want, but they are in Vector form (as it is meant to do) and not type Double.

Edit: The desired output is the original data frame (i.e. categorical features as String not index) with an additional column indicating the predicted label (which in my case is 0 or 1).

For example, say the output of my classifier looked something like this:

+-----+---------+----------+
|label| features|prediction|
+-----+---------+----------+
|  1.0|[0.0,3.0]|       1.0|
+-----+---------+----------+

By applying VectorSlicer on each feature I would get:

+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|        [0.0]|        [3.0]|
+-----+---------+----------+-------------+-------------+

Which is great, but I need:

+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|         0.0 |         3.0 |
+-----+---------+----------+-------------+-------------+

To then be able to use IndexToString and convert it to:

+-----+---------+----------+-------------+-------------+
|label| features|prediction|    status   |    artist   |
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|        good |  Pink Floyd |
+-----+---------+----------+-------------+-------------+

or even:

+-----+----------+-------------+-------------+
|label|prediction|    status   |    artist   |
+-----+----------+-------------+-------------+
|  1.0|       1.0|        good |  Pink Floyd |
+-----+----------+-------------+-------------+

Solution

  • Well, it is not a very useful operation but it should be possible to extract required information using column metadata and as simple UDF. I assume your data has been created a pipeline similar to this one:

    import org.apache.spark.ml.feature.{VectorSlicer, VectorAssembler, StringIndexer}
    import org.apache.spark.ml.Pipeline
    
    val df = sc.parallelize(Seq(
      (1L, "a", "foo", 1.0), (2L, "b", "bar", 2.0), (3L, "a", "bar", 3.0)
    )).toDF("id", "x1", "x2", "x3")
    
    val featureCols = Array("x1", "x2", "x3")
    val featureColsIdx = featureCols.map(c => s"${c}_i")
    
    val indexers = featureCols.map(
      c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_i")
    )
    
    val assembler = new VectorAssembler()
      .setInputCols(featureColsIdx)
      .setOutputCol("features")
    
    val slicer = new VectorSlicer()
      .setInputCol("features")
      .setOutputCol("string_features")
      .setNames(featureColsIdx.init)
    
    
    val transformed = new Pipeline()
      .setStages(indexers :+ assembler :+ slicer)
      .fit(df)
      .transform(df)
    

    First we can extract desired metadata from the features:

    val meta = transformed.select($"string_features")
      .schema.fields.head.metadata
      .getMetadata("ml_attr") 
      .getMetadata("attrs")
      .getMetadataArray("nominal")
    

    and convert it to something easier to use

    case class NominalMetadataWrapper(idx: Long, name: String, vals: Array[String])
    
    // In general it could a good idea to make it a broadcast variable
    val lookup = meta.map(m => NominalMetadataWrapper(
      m.getLong("idx"), m.getString("name"), m.getStringArray("vals")
    ))
    

    Finally a small UDF:

    import scala.util.Try
    
    val transFeatures = udf((v: Vector) => lookup.map{
      m => Try(m.vals(v(m.idx.toInt).toInt)).toOption
    })
    
    transformed.select(transFeatures($"string_features")).