I have a question regarding azure stream analytics (which uses T-SQL like language) and in particular working with the offset parameter inside the argument of the TUMBLING WINDOW function. Basically I am trying to use the offset argument to make the start time of the window interval inclusive and the end window time exclusive (which is the opposite to the default behaviour).
Here is the reference documentation: https://learn.microsoft.com/en-us/stream-analytics-query/tumbling-window-azure-stream-analytics
The documentation mentions this can be done with offset and gives an example but don't really understand how it works and want to be able to apply it to the scenario where I make the TUMBLING WINDOW interval 1 day (not sure if that makes a difference or not to the parameters passed into offset). I haven't managed to find any clear explanations of this so would great if anyone has any insights.
SELECT System.Timestamp() AS WindowEnd, TollId, COUNT(*)
FROM Input TIMESTAMP BY EntryTime
GROUP BY TollId, TumblingWindow(Duration(day, 1), Offset(millisecond, -1))
Here I have just taken the example in the docs and altered the time window duration to my desired duration while leaving the offset as is. Not sure if this is right (doesn't seem right) but not entirely sure how to use the offset to make the window inclusive on the start window and exclusive on the end like it mentions in the documentation example.
First let's mention two good practices when writing a query with a temporal element:
TIMESTAMP BY EventEnqueuedUtcTime
(that you will need to add to your local data sample) which is the default implicit behavior on event hub anywaySELECT
--This is for a 1 minute window, adjust as necessary
DATEADD(minute,-1,System.Timestamp()) AS WindowStart,
System.Timestamp() AS WindowEnd,
...
Now that we have that, we can look at the offset syntax. The best in my opinion is to look at it via an example (that I should then put in the doc...):
My input file:
{"ReadingTimestamp":"2021-12-10T10:00:00.0000000","deviceId":1}
{"ReadingTimestamp":"2021-12-10T10:00:59.0000000","deviceId":2}
{"ReadingTimestamp":"2021-12-10T10:01:00.0000000","deviceId":3}
{"ReadingTimestamp":"2021-12-10T10:01:01.0000000","deviceId":4}
I will look count these records on a tumbling window of 1 minute, and see how they fall in and out of them depending on offset.
That's the default behavior:
SELECT
DATEADD(minute,-1,System.Timestamp()) AS WindowStart,
System.Timestamp() AS WindowEnd,
COUNT(*) AS C
FROM Input1 TIMESTAMP BY ReadingTimestamp
GROUP BY Tumbling(duration(minute,1))
Which gives me:
WindowStart | WindowEnd | C |
---|---|---|
2021-12-10T09:59:00.0000000Z | 2021-12-10T10:00:00.0000000Z | 1 |
2021-12-10T10:00:00.0000000Z | 2021-12-10T10:01:00.0000000Z | 2 |
2021-12-10T10:01:00.0000000Z | 2021-12-10T10:02:00.0000000Z | 1 |
Notice how the first record falls on 10:00:00 and gets pushed in the previous window.
Then using the offset:
SELECT
DATEADD(minute,-1,System.Timestamp()) AS WindowStart,
System.Timestamp() AS WindowEnd,
COUNT(*) AS C
INTO Output3
FROM Input1 TIMESTAMP BY ReadingTimestamp
GROUP BY Tumbling(duration(minute,1),offset(millisecond,-1))
We have:
WindowStart | WindowEnd | C |
---|---|---|
2021-12-10T09:59:59.9990000Z | 2021-12-10T10:00:59.9990000Z | 2 |
2021-12-10T10:00:59.9990000Z | 2021-12-10T10:01:59.9990000Z | 2 |
Where records 1 and 2 are now part of the first window, and 3 and 4 the second.
But also note how we haven't made the previous windows switch the behavior at their bounds (from inclusive to exclusive and vice versa), we have simply moved the goalpost by an epsilon value to emulate the behavior.
So:
But as long as you can find an offset that is small enough to be considered negligible by your use case you are good. We have microsecond so I'm guessing that should work out fine ;)