The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint
from file to an RDD[LabeledPoint]
:
MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)
I'm unable to find the equivalent function for ml.feature.LabeledPoint
, which is not yet heavily used in the Spark documentation examples.
Can someone point me to the relevant function?
With the ml
package you won't need to put the data into a LabeledPoint
since you can specify which columns to use for labels/features in all transformations/algorithms. For example:
val gbt = new GBTClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
To load the LibSVM
file as a dataframe, simply do:
val df = spark.read.format("libsvm").load(s"$path${File.separator}${fileName}_data_sparse")
Which will return a dataframe with two columns:
The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.
See the documentation for more information.