I know that a sliding window in Spark Structured Streaming is a window on event time that has the window size (in seconds) and the step size (in seconds).
But then I came accross this:
import org.apache.spark.mllib.rdd.RDDFunctions._
sc.parallelize(1 to 100, 10)
.sliding(3)
.map(curSlice => (curSlice.sum / curSlice.size))
.collect()
I don't understand this. There is no event time here, so what does sliding
do?
If I comment in the .map line then I get results like:
[I@7b3315a5
[I@8ed9cf
[I@f72203
[I@377008df
[I@540dbda9
[I@22bb5646
[I@1be59f28
[I@2ce45a7b
[I@153d4abb
...
What does it mean to use the sliding method of mllib like that on simple intergers? And what are those Jebrish values?
In the documentation for sliding
we can see:
Returns an RDD from grouping items of its parent RDD in fixed size blocks by passing a sliding window over them. The ordering is first based on the partition index and then the ordering of items within each partition. [...]
So in the case of using sc.parallelize(1 to 100, 10)
the order would be consecutive numbers from 1 to 100.
The result of the sliding
operation is an Array
. Using print will call the toString
method for the object, however, Array
does not override this method and will use the method defined in Object
which is TypeName@hexadecimalHash
, see How do I print my Java object without getting "SomeType@2f92e0f4"?.
You can use map(_.toSeq)
to convert the array to a Seq
which would overrides the toString
method (and thus print a list as expected). Or you can use map(_.mkString(","))
to convert the array to a string instead.
The result of using sliding(3)
would be (in this fixed order):
1,2,3
2,3,4
5,6,7
...
97,98,99