Search code examples
apache-sparkpysparkapache-spark-sqlapache-spark-mllib

convert dataframe to libsvm format


I have a dataframe resulting from a sql query

df1 = sqlContext.sql("select * from table_test")

I need to convert this dataframe to libsvm format so that it can be provided as an input for

pyspark.ml.classification.LogisticRegression

I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2

df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm

I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and can't directly pip install it. So I downloaded the file, scp-ed it and then manually installed it. Everything seemed to work fine but I still get the following error

import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils

Question 1: Is my above approach to convert dataframe to libsvm format in the right direction. Question 2: If "yes" to question 1, how to get MLUtils working. If "no", what is the best way to convert dataframe to libsvm format


Solution

  • I would act like that (it's just an example with an arbitrary dataframe, I don't know how your df1 is done, focus is on data transformations):

    This is my way to convert dataframe to libsvm format:

    # ... your previous imports
    
    from pyspark.mllib.util import MLUtils
    from pyspark.mllib.regression import LabeledPoint
    
    # A DATAFRAME
    >>> df.show()
    +---+---+---+
    | _1| _2| _3|
    +---+---+---+
    |  1|  3|  6|  
    |  4|  5| 20|
    |  7|  8|  8|
    +---+---+---+
    
    # FROM DATAFRAME TO RDD
    >>> c = df.rdd # this command will convert your dataframe in a RDD
    >>> print (c.take(3))
    [Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]
    
    # FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
    >>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
    >>> print (d.take(3))
    [LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]
    
    # SAVE AS LIBSVM
    >>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")
    

    What you will see on the "/your/Path/nameFolder/part-0000*" files is:

    1.0 1:3.0 2:6.0

    4.0 1:5.0 2:20.0

    7.0 1:8.0 2:8.0

    See here for LabeledPoint docs