Search code examples

Spark 2.2: Load from file

The following line of code loads the (soon to be deprecated) mllib.regression.LabeledPoint from file to an RDD[LabeledPoint]:

MLUtils.loadLibSVMFile(spark.sparkContext, s"$path${File.separator}${fileName}_data_sparse").repartition(defaultPartitionSize)

I'm unable to find the equivalent function for ml.feature.LabeledPoint, which is not yet heavily used in the Spark documentation examples.

Can someone point me to the relevant function?


  • With the ml package you won't need to put the data into a LabeledPoint since you can specify which columns to use for labels/features in all transformations/algorithms. For example:

    val gbt = new GBTClassifier()

    To load the LibSVM file as a dataframe, simply do:

    val df ="libsvm").load(s"$path${File.separator}${fileName}_data_sparse")

    Which will return a dataframe with two columns:

    The loaded DataFrame has two columns: label containing labels stored as doubles and features containing feature vectors stored as Vectors.

    See the documentation for more information.