Search code examples
numpyapache-sparkpysparkapache-spark-sqlapache-spark-mllib

Creating Spark dataframe from numpy matrix


it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.

I've tried this:

%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

but I cannot get rid of :

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.

Many thanks.


Solution

  • You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

    sc.version
    # u'2.1.1'
    
    import numpy as np
    from pyspark.ml.linalg import Vectors
    
    df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
    dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
    
    mydf = spark.createDataFrame(dff,schema=["label", "features"])
    
    mydf.show(5)
    # +-----+-------------+ 
    # |label|     features| 
    # +-----+-------------+ 
    # |    1|[0.0,0.0,0.0]| 
    # |    0|[0.0,1.0,1.0]| 
    # |    0|[0.0,1.0,0.0]| 
    # |    1|[0.0,0.0,1.0]| 
    # |    0|[0.0,1.0,0.0]|
    # +-----+-------------+
    

    PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]