Search code examples
scalaapache-sparkapache-spark-mllibapache-spark-ml

How to vectorize DataFrame columns for ML algorithms?


have a DataFrame with some categorical string values (e.g uuid|url|browser).

I would to convert it in a double to execute an ML algorithm that accept double matrix.

As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:

def str(arg: String, df:DataFrame) : DataFrame =
   (
    val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
    val newDF = indexer.fit(df).transform(df)
    return newDF
   )

Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:

Initial df:

[String: uuid|String: url| String: browser]

Final df:

[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]

Thanks in advance


Solution

  • You can simply foldLeft over the Array of columns:

    val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))
    

    Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:

    import org.apache.spark.ml.Pipeline
    
    val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
       cname => new StringIndexer()
         .setInputCol(cname)
         .setOutputCol(s"${cname}_index")
    )
    
    // Add the rest of your pipeline like VectorAssembler and algorithm
    val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???
    
    val pipeline = new Pipeline().setStages(stages)
    val model = pipeline.fit(df)
    model.transform(df)
    

    VectorAssembler can be included like this:

    val assembler  = new VectorAssembler()
        .setInputCols(df.columns.map(cname => s"${cname}_index"))
        .setOutputCol("features")
    
    val stages = transformers :+ assembler
    

    You could also use RFormula, which is less customizable, but much more concise:

    import org.apache.spark.ml.feature.RFormula
    
    val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
    val rfModel = rf.fit(dataset)
    rfModel.transform(dataset)