scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-2.0

Spark 2.2: Load org.apache.spark.ml.feature.LabeledPoint from file

The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:

MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)

I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.

Can someone point me to the relevant function?

Solution

With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:

val gbt = new GBTClassifier()
  .setLabelCol("label")
  .setFeaturesCol("features")

To load the LibSVM file as a dataframe, simply do:

val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")

Which will return a dataframe with two columns:

The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.

See the documentation for more information.