lately I've been playing around with Stream Analytics queries with PowerBI as output sink. I made a simple query which retrieves the total count of http responsecodes of our website requests over time and groups them by date and response code. The input data is retrieved from a storage account which holds BLOB storage. This is my query:
SELECT
DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0) as datum,
request.ArrayValue.responseCode,
count(request.ArrayValue.responseCode)
INTO
[requests-httpresponsecode]
FROM
[cvweu-internet-pr-sa-requests] R TIMESTAMP BY R.context.data.eventTime
OUTER APPLY GetArrayElements(R.request) as request
GROUP BY DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0), request.ArrayValue.responseCode, System.TimeStamp
Since continuous export became active on 3 september 2018, I chose a job start time of 3 september 2018. Since I am interested in the statistics until today, I did not include a date interval so I am expecting to see data from 3 september 2018 until now (20 december 2018). The job is running fine without errors and I chose PowerBI as an output sink. Immediately I saw the chart being propagated starting from 3 september grouped by day and counting. So far, so good. A few days later I noticed the output dataset didnt start from 3 september anymore but from 2 December until now. Apparently data is being overwritten.
The following link says:
https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-power-bi-dashboard
"defaultRetentionPolicy: BasicFIFO: Data is FIFO, with a maximum of 200,000 rows."
But my output table does not have close to 200.000 rows:
datum,count,responsecode
2018-12-02 00:00:00,332348,527387
2018-12-03 00:00:00,3178250,3282791
2018-12-04 00:00:00,3170981,4236046
2018-12-05 00:00:00,2943513,3911390
2018-12-06 00:00:00,2966448,3914963
2018-12-07 00:00:00,2825741,3999027
2018-12-08 00:00:00,1621555,3353481
2018-12-09 00:00:00,2278784,3706966
2018-12-10 00:00:00,3160370,3911582
2018-12-11 00:00:00,3806272,3681742
2018-12-12 00:00:00,4402169,3751960
2018-12-13 00:00:00,2924212,3733805
2018-12-14 00:00:00,2815931,3618851
2018-12-15 00:00:00,1954330,3240276
2018-12-16 00:00:00,2327456,3375378
2018-12-17 00:00:00,3321780,3794147
2018-12-18 00:00:00,3229474,4335080
2018-12-19 00:00:00,3329212,4269236
2018-12-20 00:00:00,651642,1195501
EDIT: I have created the STREAM input source according to https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-quick-create-portal. I can create a REFERENCE input as well, but this invalidates my query since APPLY and GROUP BY are not supported and I also think STREAM input is what I want according to https://learn.microsoft.com/en-us/azure/stream-analytics/stream-analytics-add-inputs.
What am I missing? Is it my query?
Looks like my query was the problem. I had to use TUMBLINGWINDOW(day,1) instead of System.TimeStamp.
TUMBLINGWINDOW and System.TimeStamp produce exactly the same chart output on the frontend, but seem to be processed in a different way in the backend. This was not reflected to the frontend in any way so this was confusing. I suspect something is happening in the backend due to the way the query is processed when not using TUMBLINGWINDOW and you happen to hit the 200k row per dataset limit sooner than expected. The query below is the one which is producing the expected result.
SELECT
request.ArrayValue.responseCode,
count(request.ArrayValue.responseCode),
DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0) as date
INTO
[requests-httpstatuscode]
FROM
[cvweu-internet-pr-sa-requests] R TIMESTAMP BY R.context.data.eventTime
OUTER APPLY GetArrayElements(R.request) as request
GROUP BY DATETIMEFROMPARTS(DATEPART(year,R.context.data.eventTime), DATEPART(month,R.context.data.eventTime),DATEPART(day,R.context.data.eventTime),0,0,0,0),
TUMBLINGWINDOW(day,1),
request.ArrayValue.responseCode
As we speak my stream analytics job is running smoothly and producing the expected output from 3 september until now without data being overwritten.