Search code examples
apache-sparkapache-spark-mllibapache-spark-ml

Why does StandardScaler not attach metadata to the output column?


I noticed that the ml StandardScaler does not attach metadata to the output column:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val df = spark.read.option("header", true)
  .option("inferSchema", true)
  .csv("/path/to/cars.data")

val strId1 = new StringIndexer()
  .setInputCol("v7")
  .setOutputCol("v7_IDX")
val strId2 = new StringIndexer()
  .setInputCol("v8")
  .setOutputCol("v8_IDX")

val assmbleFeatures: VectorAssembler = new VectorAssembler()
  .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7_IDX"))
  .setOutputCol("featuresRaw")

val scalerModel = new StandardScaler()
  .setInputCol("featuresRaw")
  .setOutputCol("scaledFeatures")


val plm = new Pipeline()
  .setStages(Array(strId1, strId2, assmbleFeatures, scalerModel))
  .fit(df)

val dft = plm.transform(df)

dft.schema("scaledFeatures").metadata

Gives:

res1: org.apache.spark.sql.types.Metadata = {}

This example works on this dataset (just adapt path in code above).

Is there a specific reason for this? Is it likely that this feature will be added to Spark in the future? Any suggestions for a workaround that does not include duplicating the StandardScaler?


Solution

  • While discarding metadata is probably not the most fortunate choice, scaling indexed categorical features doesn't make any sense. Values returned by the StringIndexer are just labels.

    If you want to scale numerical features, it should be a separate stage:

    val numericAssembler: VectorAssembler = new VectorAssembler()
      .setInputCols(Array("v0", "v1", "v2", "v3", "v4", "v5", "v6"))
      .setOutputCol("numericFeatures")
    
    val scaler = new StandardScaler()
      .setInputCol("numericFeatures")
      .setOutputCol("scaledNumericFeatures")
    
    val finalAssembler: VectorAssembler = new VectorAssembler() 
      .setInputCols(Array("scaledNumericFeatures", "v7_IDX"))
      .setOutputCol("features")
    
    new Pipeline()
      .setStages(Array(strId1, strId2, numericAssembler, scaler, finalAssembler))
      .fit(df)
    

    Keeping in mind concerns raised at the beginning of this answer, you can also try copying the metadata:

    val result = plm.transform(df).transform(df => 
      df.withColumn(
       "scaledFeatures", 
       $"scaledFeatures".as(
         "scaledFeatures", 
         df.schema("featuresRaw").metadata)))
    
    esult.schema("scaledFeatures").metadata
    
    {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"v0"},{"idx":1,"name":"v1"},{"idx":2,"name":"v2"},{"idx":3,"name":"v3"},{"idx":4,"name":"v4"},{"idx":5,"name":"v5"},{"idx":6,"name":"v6"}],"nominal":[{"vals":["ford","chevrolet","plymouth","dodge","amc","toyota","datsun","vw","buick","pontiac","honda","mazda","mercury","oldsmobile","peugeot","fiat","audi","chrysler","volvo","opel","subaru","saab","mercedes","renault","cadillac","bmw","triumph","hi","capri","nissan"],"idx":7,"name":"v7_IDX"}]},"num_attrs":8}}