Search code examples
scalaapache-sparkmachine-learningapache-spark-mllibsliding-window

Understanding mllib sliding


I know that a sliding window in Spark Structured Streaming is a window on event time that has the window size (in seconds) and the step size (in seconds).

But then I came accross this:

import org.apache.spark.mllib.rdd.RDDFunctions._

sc.parallelize(1 to 100, 10)
  .sliding(3)
  .map(curSlice => (curSlice.sum / curSlice.size))
  .collect()

I don't understand this. There is no event time here, so what does sliding do?

If I comment in the .map line then I get results like:

[I@7b3315a5
[I@8ed9cf
[I@f72203
[I@377008df
[I@540dbda9
[I@22bb5646
[I@1be59f28
[I@2ce45a7b
[I@153d4abb
...

What does it mean to use the sliding method of mllib like that on simple intergers? And what are those Jebrish values?


Solution

  • In the documentation for sliding we can see:

    Returns an RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them. The ordering is first based on the partition index and then the ordering of items within each partition. [...]

    So in the case of using sc.parallelize(1 to 100, 10) the order would be consecutive numbers from 1 to 100.

    The result of the sliding operation is an Array. Using print will call the toString method for the object, however, Array does not override this method and will use the method defined in Object which is TypeName@hexadecimalHash, see How do I print my Java object without getting "SomeType@2f92e0f4"?.

    You can use map(_.toSeq) to convert the array to a Seq which would overrides the toString method (and thus print a list as expected). Or you can use map(_.mkString(",")) to convert the array to a string instead.

    The result of using sliding(3) would be (in this fixed order):

    1,2,3
    2,3,4
    5,6,7
    ...
    97,98,99