scala apache-spark apache-spark-mllib apache-spark-ml

Parallel training independent model in SparkML (Scala)

Suppose I have 3 simple SparkML models that will use the same DataFrame as input, but are completely independent of each other (in both running sequence and columns of data being used).

The first thing that pops to my mind is that just create a Pipeline array with the 3 models in the stages array, and running the overarching fit/transform to get the full prediction and such.

BUT, my understanding is that because we're stacking these models in a single pipeline as a sequence, Spark will not necessarily run these models in parallel, even though they are completely independent of each other.

That being said, Is there a way to fit/transform 3 independent models in parallel? The first thing I thought of was to create a function/object that makes a pipeline, and then running a map or parmap where I will run the 3 models in the map function, but I don't know if that'll take advantage of the parallelism.

These are not really cross validation type models either; the workflow I'd like is:

Prep my dataframe
The dataframe will have let's say 10 columns of 0-1s
I will run a total of 10 models, where each model will take one of the 10 columns, filter the data if that column val == 1, and then fit/transform.

Hence, the independence comes from the fact that these individual models are not chained and can be run as-is.

Thanks!

Solution

The SparkML supports parallel evaluation for the same pipeline https://spark.apache.org/docs/2.3.0/ml-tuning.html. But for different models I haven´t seen any implementation yet. If you use a parallel collection to wrap your pipelines the first model that it´s fitted get the resources of your Spark App. Maybe with the RDD api you could do something, but with Spark ML... training different pipelines in parallel and to spawn different parallel stages each of them with a different pipeline model at the moment it is not possible.