scala apache-spark cross-validation apache-spark-ml

Spark achieve parallel Cross Validation for Scala api

Pyspark offers a great possibility to parallelize cross-validation of models via https://github.com/databricks/spark-sklearn as simple substitution of sklearn's GridSearchCV with

from spark_sklearn import GridSearchCV

How can I achieve similar functionality for Spark's Scala CrossValidator i.e. to parallelize each fold?

Solution

Since spark 2.3 :

You can do that using the setParallelism(n) method with the CrossValidator or on creation. i.e :

cv.setParallelism(2)

cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, \ 
                    parallelism=2)  // Evaluate up to 2 parameter settings in parallel

Before spark 2.3 :

You can't do that in Spark Scala. You can't parallelize Cross Validation in Scala Spark.

If you have read the documentation of spark-sklearn well, GridSearchCV is parallelized but the model training isn't. Thus this is kind of useless on scale. Furthermore, you can parallelize Cross Validation for the Spark Scala API due to the famous SPARK-5063 :

RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.

Excerpt from the README.md :

This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other tools:

train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn. convert Spark's Dataframes seamlessly into numpy ndarrays or sparse matrices. (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors. It focuses on problems that have a small amount of data and that can be run in parallel.

for small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark, for datasets that do not fit in memory, we recommend using the distributed implementation in Spark MLlib.

NOTE: This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).