I am trying to eliminate the data anomalies in the data I am receiving from eventhub and send only selected data to azure functions through Azure stream analytics for that I am writing a sql query where I need some help
Requirement: I need to collect the past 60 seconds data and need to group by Id and compare the records that I received in the 60 seconds and If any record value is way higher than the selected values than ignore that record (for example, I will collect the 4 records in past 60 seconds and if the data is 40 40 40 40 5. We should drop the 5. Example 2 - 20 20 20 500 drop the 500. ).
My sql table will be something like this:
id Temp date datetime
123 30 2023-01-01 2023-01-01 12:00:00
124 35 2023-01-01 2023-01-01 12:00:00
123 31 2023-01-01 2023-01-01 12:00:00
123 33 2023-01-01 2023-01-01 12:00:00
123 60 2023-01-01 2023-01-01 12:00:00
124 36 2023-01-01 2023-01-01 12:00:00
124 36 2023-01-01 2023-01-01 12:00:00
124 8 2023-01-01 2023-01-01 12:00:00
124 36 2023-01-01 2023-01-01 12:00:00
I need to eliminate the records that are not in the range with the other records
I'll leave the details of the comparison up to you, but you can use a CROSS APPLY
to gather the data for comparison.
Something like:
SELECT *
FROM TemperatureData T
CROSS APPLY (
SELECT AVG(T2.Temp * 1.0) AS PriorAvgTemp, COUNT(*) As PriorCount
FROM TemperatureData T2
WHERE T2.id = T.id
AND T2.datetime >= DATEADD(second, -60, T.datetime)
AND T2.datetime < T.datetime
) P
WHERE T.Temp BETWEEN P.PriorAvgTemp - 10 AND P.PriorAvgTemp + 10
--OR P.PriorCount < 3 -- Should we allow if there is insufficient prior data
--AND P.PriorCount >= 3 -- Should we omit if there is insufficient prior data
Be sure you have an index on TemperatureData(id, datetime)
.
If you are willing to accept the last N values instead of a time range, windowed aggregate calculation may be more efficient.
SELECT *
FROM (
SELECT *,
AVG(T.Temp * 1.0)
OVER(PARTITION BY id ORDER BY datetime
ROWS BETWEEN 60 PRECEDING AND 1 PRECEDING)
AS PriorAvgTemp,
COUNT(*)
OVER(PARTITION BY id ORDER BY datetime
ROWS BETWEEN 60 PRECEDING AND 1 PRECEDING)
AS PriorCount
FROM TemperatureData T
) TT
WHERE TT.Temp BETWEEN TT.PriorAvgTemp - 10 AND TT.PriorAvgTemp + 10
--OR TT.PriorCount < 3 -- Should we allow if there is insufficient prior data
--AND TT.PriorCount >= 3 -- Should we omit if there is insufficient prior data
Please note: The above is untested code, which may need some syntax fixes and debugging. If you discover errors, please comment and I will correct the post.