Search code examples
pythonapache-sparkspark-structured-streamingapache-spark-ml

(Py-)Spark train model with spark structured-streaming


I'm using Spark 3.x and I can't figure out how to train a model like for example the Random Forest Classifier using Spark structured streaming, not spark streaming.

I've set up the needed stream to deal with the micro-batches for training and I have the spark.ml pipeline set up but I'm missing out on a function or something like partial fit

As spark is basically made for big data and distributed ml etc. there has to be a method like this

The code for training would look something like this:

(training_data, test_data) = data.randomSplit([0.7, 0.3])
pipeline = Pipeline(stages=[featureIndexerA, assembler, rf, labelConverter])
model = pipeline.fit(training_data)

How can this be used with multiple micro-batches?


Solution

  • So as it turns out: no there is no native implementation of spark.ml which can train a random forest piece by piece.

    If you have a huge data set and can't feed it in once you could use sklearn where you can train two or more models with different parts of the data and combine them afterwards. BUT this just adds all the trees which makes your model grow very large (if using 3 forests each with 20 trees results in a singular 60 tree random forest)

    you can do this either manually by adding the trees to the estimators list or use the built in feature warm_start

    in terms of accuracy it looks quite promising as it performs about the same sometimes even better compared to training all at once. But I only compared a 40 estimator forest trained with the whole data set vs two 20 estimators with each the half data set

    If a random forest is not what you need, there are some algorithms working with streams BUT only with spark streaming (the RDD based one) not spark structured streaming (the df based one). These are marked in the docs

    Anyway there are some research papers implementing random forests using spark structured streaming but I haven't tried one of them because it looks pretty time consuming