it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.
I've tried this:
%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(df,["label", "features"])
but I cannot get rid of :
TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector
I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.
Many thanks.
You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint
when using spark-ml
:
sc.version
# u'2.1.1'
import numpy as np
from pyspark.ml.linalg import Vectors
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(dff,schema=["label", "features"])
mydf.show(5)
# +-----+-------------+
# |label| features|
# +-----+-------------+
# | 1|[0.0,0.0,0.0]|
# | 0|[0.0,1.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# | 1|[0.0,0.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# +-----+-------------+
PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]