Search code examples
apache-sparkapache-spark-sqlapache-spark-ml

Spark Structured Streaming and Spark-Ml Regression


Is it possible to apply Spark-Ml regression to streaming sources? I see there is StreamingLogisticRegressionWithSGD but It's for older RDD API and I couldn't use It with structured streaming sources.

  1. How I'm supposed to apply regressions on structured streaming sources?
  2. (A little OT) If I cannot use streaming API for regression how can I commit offsets or so to source in a batch processing way? (Kafka sink)

Solution

  • Today (Spark 2.2 / 2.3) there is no support for machine learning in Structured Streaming and there is no ongoing work in this direction. Please follow SPARK-16424 to track future progress.

    You can however:

    • Train iterative, non-distributed models using forEach sink and some form of external state storage. At a high level regression model could be implemented like this:

      • Fetch latest model when calling ForeachWriter.open and initialize loss accumulator (not in a Spark sense, just local variable) for the partition.
      • Compute loss for each record in ForeachWriter.process and update accumulator.
      • Push loses to external store when calling ForeachWriter.close.
      • This would leave external storage in charge with computing gradient and updating model with implementation dependent on the store.
    • Try to hack SQL queries (see https://github.com/holdenk/spark-structured-streaming-ml by Holden Karau)