Search code examples
scalaapache-sparkapache-spark-mllibapache-spark-mlapache-spark-dataset

Using MLUtils.convertVectorColumnsToML() inside a UDF?


I have a Dataset/Dataframe with a mllib.linalg.Vector (of Doubles) as one of the columns. I would like to add another column to this dataset of type ml.linalg.Vector to this data set (so I will have both types of Vectors). The reason is I am evaluating few algorithms and some of those expect mllib vector and some expect ml vector. Also, I have to feed o/p of one algorithm to another and each use different types.

Can someone please help me convert mllib.linalg.Vector to ml.linalg.Vector and append a new column to the data set in hand. I tried using MLUtils.convertVectorColumnsToML() inside an UDF and regular functions but not able to get it to working. I am trying to avoid creating a new dataset and then doing inner join and dropping the columns as the data set will be huge eventually and joins are expensive.


Solution

  • You can use the method toML to convert from mllib to ml vector. An UDF and usage example can look like this:

    val convertToML = udf((mllibVec: org.apache.spark.mllib.linalg.Vector) = > {
      mllibVec.asML
    })
    
    val df2 = df.withColumn("mlVector", convertToML($"mllibVector"))
    

    Assuming df to be the original dataframe and the column with the mllib vector to be named mllibVector.