Search code examples
pysparkapache-spark-mllib

HDFS Files as input to Spark Mllib


All the examples in the tutorial use files in LibSVM format as input to Spark Mllib.(http://spark.apache.org/docs/latest/mllib-ensembles.html)

data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')

But I have a file with millions of rows located on HDFS and want to give this as an input to Spark MLLib using PySpark and I do not want to convert it into libsvm format.

Can anyone please guide me how to do this?


Solution

  • Generally when you give an input an algorithm in MLLib, you create an rdd of a certain data Type (say LabeledPoint Or a vector.) MLUtils.loadLibSVMFile will convert your data into a labeledpoint RDD for you.

    You can directly transform your data into whatever format the algorithm needs and then give the resultant RDD as an input to your MLLib algorithm.

    http://spark.apache.org/docs/latest/mllib-data-types.html