I am having a dataframe(df_ml_nullable), like this :
+-----+--------------------+
|label| features|
+-----+--------------------+
| 0.0|[127.0,132.0,123....|
| 0.0|[67.0,67.0,67.0,6...|
| 0.0|[-29.0,-30.0,-28....|
| 4.0|[31.0,31.0,31.0,3...|
| 0.0|[39.0,40.0,42.0,4...|
+-----+--------------------+
Below is the schema of this data frame: df_ml_nullable.printSchema()
root
|-- label: double (nullable = false)
|-- features: vector (nullable = false)
I try to run the logistic regression like this :
from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(maxIter=10, regParam=0.01)
(train_d,test_d)=df_ml_nullable.randomSplit([0.7, 0.3])
model1 = lr.fit(train_d)
When I try to run this I get this error : IllegalArgumentException: u'requirement failed: Column features must be of type struct,values:array> but was actually struct,values:array>.'
Has anyone faced this issue?
The problem was with the import. Instead of importing from ml I was importing the vectors from mllib. The below correction did the trick:
#from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.ml.linalg import Vectors,VectorUDT
@ Vincent - Thanks for the hint.