apache-spark apache-spark-sql apache-spark-ml

Spark Structured Streaming and Spark-Ml Regression

Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.

How I'm supposed to apply regressions on structured streaming sources?
(A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink)

Solution

Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.

You can however:

Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:
- Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
- Compute loss for each record in ForeachWriter.process and update accumulator.
- Push loses to external store when calling ForeachWriter.close.
- This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)