HDFS Files as input to Spark Mllib

All the examples in the tutorial use files in LibSVM format as input to Spark Mllib.(http://spark.apache.org/docs/latest/mllib-ensembles.html)

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

But I have a file with millions of rows located on HDFS and want to give this as an input to Spark MLLib using PySpark and I do not want to convert it into libsvm format.

Can anyone please guide me how to do this?

Solution

Generally when you give an input an algorithm in MLLib, you create an rdd of a certain data Type (say LabeledPoint Or a vector.) MLUtils.loadLibSVMFile will convert your data into a labeledpoint RDD for you.

You can directly transform your data into whatever format the algorithm needs and then give the resultant RDD as an input to your MLLib algorithm.

http://spark.apache.org/docs/latest/mllib-data-types.html