Search code examples
apache-sparkapache-spark-mllibpyspark

Error while running Logistic Regression in pyspark mlib


I am having a dataframe(df_ml_nullable), like this :

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|[127.0,132.0,123....|
|  0.0|[67.0,67.0,67.0,6...|
|  0.0|[-29.0,-30.0,-28....|
|  4.0|[31.0,31.0,31.0,3...|
|  0.0|[39.0,40.0,42.0,4...|
+-----+--------------------+

Below is the schema of this data frame: df_ml_nullable.printSchema()

root
 |-- label: double (nullable = false)
 |-- features: vector (nullable = false)

I try to run the logistic regression like this :

    from pyspark.ml.linalg import Vectors
    from pyspark.ml.classification import LogisticRegression
    lr = LogisticRegression(maxIter=10, regParam=0.01)
    (train_d,test_d)=df_ml_nullable.randomSplit([0.7, 0.3])
    model1 = lr.fit(train_d)

When I try to run this I get this error : IllegalArgumentException: u'requirement failed: Column features must be of type struct,values:array> but was actually struct,values:array>.'

Has anyone faced this issue?


Solution

  • The problem was with the import. Instead of importing from ml I was importing the vectors from mllib. The below correction did the trick:

    #from pyspark.mllib.linalg import Vectors, VectorUDT
    from pyspark.ml.linalg import Vectors,VectorUDT
    

    @ Vincent - Thanks for the hint.