Search code examples
azureazure-stream-analytics

CollectTop is returning more rows than I would expect in Azure Stream Analytics


I have the following input (testing in the Azure portal) that I have uploaded:

[
  {"engineid":"engine001","eventtime":1,"tmp":19.3,"hum":0.22},
  {"engineid":"engine001","eventtime":2,"tmp":19.7,"hum":0.21},
  {"engineid":"engine002","eventtime":3,"tmp":20.4,"hum":0.25},
  {"engineid":"engine001","eventtime":4,"tmp":19.6,"hum":0.24}
]

Then I try to get the records group, so that I have the 2 last rows for every engine. As you can see in the sample, I only have 2 different engines, so I would expect an output with two records that each contain the ranked records, but I am getting 4 output records.

This is my query:

-- Taking relevant fields from the input stream
WITH RelevantTelemetry AS
(
    SELECT  engineid, tmp, hum, eventtime
    FROM    [engine-telemetry] 
    WHERE   engineid IS NOT NULL
),
-- Grouping by engineid in TimeWindows
TimeWindows AS
(
    SELECT engineid, 
        CollectTop(2) OVER (ORDER BY eventtime DESC) as TimeWindow
    FROM
        [RelevantTelemetry]
    WHERE engineid IS NOT NULL
    GROUP BY SlidingWindow(hour, 24), engineid
)
--Output timewindows for verification purposes
SELECT TimeWindow
INTO debug
FROM TimeWindows

I played with the TIMESTAMP BY property, changed the order of GROUP BY, etc, but still I keep having the following 4 records, rather than the 2 I was expecting:

Any idea?

[
{"TimeWindow":
  [
    {"rank":1,"value": "engineid":"engine001","tmp":0.0003,"hum":-0.0002,"eventtime":1}}
  ]},
{"TimeWindow":
  [
    {"rank":1,"value":{"engineid":"engine001","tmp":-0.0019,"hum":-0.0002,"eventtime":4}},
    {"rank":2,"value":{"engineid":"engine001","tmp":-0.0026,"hum":-0.0002,"eventtime":2}},
    {"rank":3,"value":{"engineid":"engine001","tmp":0.0003,"hum":-0.0002,"eventtime":1}}
  ]},
{"TimeWindow":
  [
    {"rank":1,"value":{"engineid":"engine002","tmp":0.0017,"hum":0.0003,"eventtime":3}}
  ]},
{"TimeWindow":
  [
    {"rank":1,"value":{"engineid":"engine001","tmp":-0.0019,"hum":-0.0002,"eventtime":4}},
    {"rank":2,"value":{"engineid":"engine001","tmp":-0.0026,"hum":-0.0002,"eventtime":2}}
  ]}
]

Solution

  • As suggested by @SteveZhao, you need to use GROUP BY TumblingWindow(hour, 24), engineid instead of GROUP BY SlidingWindow(hour, 24), engineid

    Sliding window can overlap entries based on time intervalenter image description here

    For more information refer: https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions