Search code examples
apache-sparkmachine-learningpysparkapache-spark-mlliblogistic-regression

'Input validation failed' error in Spark MLlib multi-class logistic regression


I have a training set of 5000 rows and 401 columns wherein column 1 is the label and the remaining 400 columns are features. I am trying to do a multiclass logistic regression using pyspark mllib. Please find my code below. I must confess , this not an optimized /well written code as I am still a newbie in the field of python/pyspark.

tset=sio.loadmat('ex3data1.mat') # load the training set from a mat file
X=tset['X']                      # read the X,y values
y=tset['y']
print(X.shape) # works!
print(y.shape)
sp= 
SparkSession.builder.master("local").appName("multiclassifier").getOrCreate()
sc=sp.sparkContext
XY=np.concatenate((y,X),axis=1) # 5000x401 where the first col is the label.
print(XY[0:2])

sample output of the print above. Please note that I only printing the first row

[[  1.00000000e+01   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
8.56059680e-06   1.94035948e-06  -7.37438725e-04  -8.13403799e-03
-1.86104473e-02  -1.87412865e-02  -1.87572508e-02  -1.90963542e-03
-1.64039011e-02  -3.78191381e-03   3.30347316e-04   1.27655229e-05
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   1.16421569e-04
    1.20052179e-04  -1.40444581e-02  -2.84542484e-02   8.03826593e-02
    2.66540339e-01   2.73853746e-01   2.78729541e-01   2.74293607e-01
    2.24676403e-01   2.77562977e-02  -7.06315478e-03   2.34715414e-04
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.28335523e-17  -3.26286765e-04
   -1.38651604e-02   8.15651552e-02   3.82800381e-01   8.57849775e-01
    1.00109761e+00   9.69710638e-01   9.30928598e-01   1.00383757e+00
    9.64157356e-01   4.49256553e-01  -5.60408259e-03  -3.78319036e-03
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    5.10620915e-06   4.36410675e-04  -3.95509940e-03  -2.68537241e-02
    1.00755014e-01   6.42031710e-01   1.03136838e+00   8.50968614e-01
    5.43122379e-01   3.42599738e-01   2.68918777e-01   6.68374643e-01
    1.01256958e+00   9.03795598e-01   1.04481574e-01  -1.66424973e-02
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    2.59875260e-05  -3.10606987e-03   7.52456076e-03   1.77539831e-01
    7.92890120e-01   9.65626503e-01   4.63166079e-01   6.91720680e-02
   -3.64100526e-03  -4.12180405e-02  -5.01900656e-02   1.56102907e-01
    9.01762651e-01   1.04748346e+00   1.51055252e-01  -2.16044665e-02
    0.00000000e+00   0.00000000e+00   0.00000000e+00   5.87012352e-05
   -6.40931373e-04  -3.23305249e-02   2.78203465e-01   9.36720163e-01
    1.04320956e+00   5.98003217e-01  -3.59409041e-03  -2.16751770e-02
   -4.81021923e-03   6.16566793e-05  -1.23773318e-02   1.55477482e-01
    9.14867477e-01   9.20401348e-01   1.09173902e-01  -1.71058007e-02
    0.00000000e+00   0.00000000e+00   1.56250000e-04  -4.27724104e-04
   -2.51466503e-02   1.30532561e-01   7.81664862e-01   1.02836583e+00
    7.57137601e-01   2.84667194e-01   4.86865128e-03  -3.18688725e-03
    0.00000000e+00   8.36492601e-04  -3.70751123e-02   4.52644165e-01
    1.03180133e+00   5.39028101e-01  -2.43742611e-03  -4.80290033e-03
    0.00000000e+00   0.00000000e+00  -7.03635621e-04  -1.27262443e-02
    1.61706648e-01   7.79865383e-01   1.03676705e+00   8.04490400e-01
    1.60586724e-01  -1.38173339e-02   2.14879493e-03  -2.12622549e-04
    2.04248366e-04  -6.85907627e-03   4.31712963e-04   7.20680947e-01
    8.48136063e-01   1.51383408e-01  -2.28404366e-02   1.98971950e-04
    0.00000000e+00   0.00000000e+00  -9.40410539e-03   3.74520505e-02
    6.94389110e-01   1.02844844e+00   1.01648066e+00   8.80488426e-01
    3.92123945e-01  -1.74122413e-02  -1.20098039e-04   5.55215142e-05
   -2.23907271e-03  -2.76068376e-02   3.68645493e-01   9.36411169e-01
    4.59006723e-01  -4.24701797e-02   1.17356610e-03   1.88929739e-05
    0.00000000e+00   0.00000000e+00  -1.93511951e-02   1.29999794e-01
    9.79821705e-01   9.41862388e-01   7.75147704e-01   8.73632241e-01
    2.12778350e-01  -1.72353349e-02   0.00000000e+00   1.09937426e-03
   -2.61793751e-02   1.22872879e-01   8.30812662e-01   7.26501773e-01
    5.24441863e-02  -6.18971913e-03   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00  -9.36563862e-03   3.68349741e-02
    6.99079299e-01   1.00293583e+00   6.05704402e-01   3.27299224e-01
   -3.22099249e-02  -4.83053002e-02  -4.34069138e-02  -5.75151144e-02
    9.55674190e-02   7.26512627e-01   6.95366966e-01   1.47114481e-01
   -1.20048679e-02  -3.02798203e-04   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00  -6.76572712e-04  -6.51415556e-03
    1.17339359e-01   4.21948410e-01   9.93210937e-01   8.82013974e-01
    7.45758734e-01   7.23874268e-01   7.23341725e-01   7.20020340e-01
    8.45324959e-01   8.31859739e-01   6.88831870e-02  -2.77765012e-02
    3.59136710e-04   7.14869281e-05   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   1.53186275e-04   3.17353553e-04
   -2.29167177e-02  -4.14402914e-03   3.87038450e-01   5.04583435e-01
    7.74885876e-01   9.90037446e-01   1.00769478e+00   1.00851440e+00
    7.37905042e-01   2.15455291e-01  -2.69624864e-02   1.32506127e-03
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    2.36366422e-04  -2.26031454e-03  -2.51994485e-02  -3.73889910e-02
    6.62121228e-02   2.91134498e-01   3.23055726e-01   3.06260315e-01
    8.76070942e-02  -2.50581917e-02   2.37438725e-04   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   6.20939216e-18   6.72618320e-04
   -1.13151411e-02  -3.54641066e-02  -3.88214912e-02  -3.71077412e-02
   -1.33524928e-02   9.90964718e-04   4.89176960e-05   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00]]

end of print output.

pXYdf=pd.DataFrame(XY)
sXYdf=sp.createDataFrame(pXYdf)

from pyspark.mllib.classification import LogisticRegressionWithLBFGS, 
LogisticRegressionModel
import pyspark.mllib.regression as reg

trainingData = sXYdf.rdd.map(lambda x: reg.LabeledPoint(x[0],x[1:]))
trainingData.take(2) # this works!!

output of 1 record in LabeledPoint format: ( i am unable to properly format that here as there are 400 features here`.

[LabeledPoint(10.0,[0.0,0.0,0.0,0.0,0.0,0.0,....,8.56059679589e-06, 1.94035947712e06,.........]),

lrm=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)

However this somehow does not seem to work as I am getting a py4javaError. Unfortunately, I am having some local pyspark execution issue which i am trying to debug and hence could not cut and paste the error message.


Solution

  • My guess is that you got the error

    : org.apache.spark.SparkException: Input validation failed.
    

    There is a -practically undocumented- feature in Spark MLlib logistic regression, according to which for k classes, the labels should be 0, 1, ..., k-1, i.e. they must start from zero. No matter how long you stare in the relevant Pyspark documentation, you will not find this requirement, simply because it is not there - to see it, you must dig in the relevant Scala source code, where it is mentioned:

    * @note Labels used in Logistic Regression should be {0, 1, ..., k - 1}
    

    Judging from the code snippets you have provided, it seems that you have 10 classes, labeled as 1, 2, ..., 10; this will not work, and it will produce the above error. Here is a short demonstration with dummy data and 3 classes:

    sc.version
    # u'2.1.1'
    
    from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
    from pyspark.mllib.regression import LabeledPoint
    
    data = [LabeledPoint(3.0, [4.6,3.6,1.0,0.2]), # 3 classes, labeled 1, 2, 3
            LabeledPoint(3.0, [5.7,4.4,1.5,0.4]),
            LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
            LabeledPoint(2.0, [4.8,3.4,1.6,0.2]),
            LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]
    
    model = LogisticRegressionWithLBFGS.train(sc.parallelize(data), numClasses=3)
    [...]
    Py4JJavaError: An error occurred while calling o173.trainLogisticRegressionModelWithLBFGS. 
    : org.apache.spark.SparkException: Input validation failed.
    

    So, if this is your case indeed, you need to subtract 1 from your labels, thus converting them from 1, 2, ..., 10 to 0, 1, ..., 9. Here is a quick way to do so in my 3-class dummy data:

    new_data = sc.parallelize(data).map(lambda x: LabeledPoint(x.label-1, x.features))
    new_data.collect() # for demonstration purposes only
    # [LabeledPoint(2.0, [4.6,3.6,1.0,0.2]), # 3 classes, labeled 0, 1, 2
    #  LabeledPoint(2.0, [5.7,4.4,1.5,0.4]),
    #  LabeledPoint(0.0, [6.7,3.1,4.4,1.4]),
    #  LabeledPoint(1.0, [4.8,3.4,1.6,0.2]),
    #  LabeledPoint(0.0, [4.4,3.2,1.3,0.2])]
    
    model = LogisticRegressionWithLBFGS.train(new_data, numClasses=3) # works OK
    

    I have argued in length on this Input validation fail and other undocumented and/or counter-intuitive stuff in Spark 2 in a blog post, which you may find useful.