Search code examples
apache-sparkapache-spark-mllib

Spark Pipeline - How to extract attributes from trained features transformer


I need extract attributes from trained transformers, so I can use them for serving later, such as bin boundaries from QuantileDiscretizer, name to index map from StringIndexer. For example, how to extract bin boundaries from "discretizer_trained" in code below. I was not able to find introduction by googling as well as from official documentation https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.QuantileDiscretizer

//https://spark.apache.org/docs/latest/ml-features.html#quantilediscretizer
import org.apache.spark.ml.feature.QuantileDiscretizer

val data = Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2))
val df = spark.createDataFrame(data).toDF("id", "hour")

val discretizer = new QuantileDiscretizer()
  .setInputCol("hour")
  .setOutputCol("result")
  .setNumBuckets(3)

val discretizer_trained = discretizer.fit(df)

Solution

  • In Scala Spark running:

      discretizer_trained.getSplits
    

    in your example will produce:

      res1: Array[Double] = Array(-Infinity, 5.0, 18.0, Infinity)