machine-learning scikit-learn pyspark cross-validation apache-spark-ml

Can i use dataframe with sparse vector to do cross-validation tuning?

i'm training my multilayer Perceptron Classifier. Here's my training set.The features are in sparse vector format.

df_train.show(10,False)
+------+---------------------------+
|target|features                   |
+------+---------------------------+
|1.0   |(5,[0,1],[164.0,520.0])    |
|1.0   |[519.0,2723.0,0.0,3.0,4.0] |
|1.0   |(5,[0,1],[2868.0,928.0])   |
|0.0   |(5,[0,1],[57.0,2715.0])    |
|1.0   |[1241.0,2104.0,0.0,0.0,2.0]|
|1.0   |[3365.0,217.0,0.0,0.0,2.0] |
|1.0   |[60.0,1528.0,4.0,8.0,7.0]  |
|1.0   |[396.0,3810.0,0.0,0.0,2.0] |
|1.0   |(5,[0,1],[905.0,2476.0])   |
|1.0   |(5,[0,1],[905.0,1246.0])   |
+------+---------------------------+

Fist of all, i want to evaluate my estimator on a hold out method, here's my code:

from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

layers = [4, 5, 4, 3]
trainer = MultilayerPerceptronClassifier(maxIter=100, layers=layers, blockSize=128, seed=1234)
param = trainer.setParams(featuresCol = "features",labelCol="target")

train,test = df_train.randomSplit([0.8, 0.2])
model = trainer.fit(train)
result = model.transform(test)
evaluator = MulticlassClassificationEvaluator(
    labelCol="target", predictionCol="prediction", metricName="accuracy")
print("Test set accuracy = " + str(evaluator.evaluate(result)))

But it turns out the error:Failed to execute user defined function($anonfun$1: (vector) => double). Is this because i have sparse vector in my features?What can i do?

And for the cross-validation part, I coded as following:

X=df_train.select("features").collect()
y=df_train.select("target").collect()
from sklearn.model_selection import cross_val_score,KFold
k_fold = KFold(n_splits=10, random_state=None, shuffle=False)
print(cross_val_score(trainer, X, y, cv=k_fold, n_jobs=1,scoring="accuracy"))

And I get: it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods. But when i look up the document, i didn't find get_params method.Can someone help me with this?

Solution

There is a number of issues with your question...

Focusing on the second part (it is actually a separate question), the error message claim, i.e. that

it does not seem to be a scikit-learn estimator

is indeed correct, since you are using the MultilayerPerceptronClassifier from PySpark ML as trainer in the scikit-learn method cross_val_score (they are not compatible).

Additionally, your 2nd code snippet is not at all PySpark-like, but scikit-learn-like: while you use correctly the input in your 1st snippet (a single 2-column dataframe, with the features in one column and the labels/targets in the other), you seem to have forgotten this lesson in your second snippet, where you build separate dataframes X and y for input to your classifier (which should be the case in scikit-learn but not in PySpark). See the CrossValidator docs for a straightforward example of the correct usage.

From a more general viewpoint: if your data fit in the main memory (i.e. you can collect them as you do for your CV), there is absolutely no reason to bother with Spark ML, and you would be far better off with scikit-learn.

Regarding the 1st part: the data you have shown seem to have only 2 labels 0.0/1.0; I cannot be sure (since you show only 10 records), but if indeed you have only 2 labels you should not use MulticlassClassificationEvaluator but BinaryClassificationEvaluator - which however, does not have a metricName="accuracy" option... [EDIT: against all odds, seems that MulticlassClassificationEvaluator indeed can work for binary classification, too, and it is a handy way to get the accuracy, which is not provided with its binary counterpart!]

But this is not why you get this error (which, BTW, has nothing to do with the evaluator - you get it with result.show() or result.collect()); the reason for the error is that the number of nodes in your first layer (layers[0]) is 4, while your input vectors are evidently 5-dimensional. From the docs:

Number of inputs has to be equal to the size of feature vectors

Changing layers[0] to 5 resolves the issue (not shown). Similarly, if you indeed have only 2 classes, you should also change layers[-1] to 2 (you'll not get an error if you don't, but it won't make much sense from a classification point of view).