Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-mllibapache-spark-2.0

Spark 2.2: Load org.apache.spark.ml.feature.LabeledPoint from file


The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:

MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)

I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.

Can someone point me to the relevant function?


Solution

  • With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:

    val gbt = new GBTClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
    

    To load the LibSVM file as a dataframe, simply do:

    val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")
    

    Which will return a dataframe with two columns:

    The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.

    See the documentation for more information.