I am currently querying data from a Parquet data lake stored in Azure Data Lake Gen2
via Azure Synapse
(Serverless SQL pools) and through Grafana
with Microsoft SQL Server
as my Grafana data source.
My query looks as below:
SELECT
DATEADD(SECOND, DATEDIFF(SECOND, '2020', t) / (200/1000.0) * (200/1000.0), '2020') AS time,
AVG(accelerationx) as AVG_accelerationx
FROM
OPENROWSET(
BULK 'https://cssdatalakestoragegen2.dfs.core.windows.net/cssdatalakestoragegen2filesystem/3BA199E2/CAN2_gnssimu/*/*/*/*',
FORMAT = 'PARQUET'
) AS r
WHERE
t BETWEEN '2020-10-28T14:35:31Z' AND '2020-10-28T14:38:10Z'
GROUP BY
DATEDIFF(SECOND, '2020', t) / (200/1000.0)
ORDER BY time
OFFSET 0 ROWS;
This results in data grouped to a 1 second resolution, even though the expected result was data grouped to a 200 ms resolution. The original file is stored at a 100 ms resolution.
The Parquet file I am trying to query is stored as follows in Azure:
https://cssdatalakestoragegen2.dfs.core.windows.net/cssdatalakestoragegen2filesystem/3BA199E2/CAN2_gnssimu/2020/10/28/00000014_00000001.parquet
I have attached the Parquet file in question: https://canlogger1000.csselectronics.com/files/temp/00000014_00000001.parquet
I assume my query is causing the results to be aggregated to SECOND
basis instead of MILLISECOND
basis - but replacing SECOND
in the query (and directly parsing the 200 ms instead of 0.2 seconds) causes an overflow error as below:
convert frame from rows error: mssql: The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.
The rounding already happened in the datediff()
. The division afterward won't create new buckets.
So just add another level of grouping:
GROUP BY
DATEDIFF(SECOND, '2020', t),
DATEPART(MILLISECOND, t) / 200
To get back the time value add the two components:
DATEADD(MILLISECOND, DATEPART(MILLISECOND, t) / 200, DATEADD(SECOND, DATEDIFF(SECOND, '2020', t)))
One advantage of doing it in separate levels is that the output range of the expressions might avoid or forestall needing to deal with large integers and the potential for overflows. It's also possibly easier to comprehend. On the other hand it could still be worth comparing the query plans to see if the execution is improved with a single value for bucketing and sorting.
Another simplifying assumption here is that your buckets always align with the parent. So a caveat is that you wouldn't be able to just change the divisor to 400, for example, to get increments of 400 milliseconds. From that standpoint this is a disadvantage.
Correcting is still fairly easy and merely requires dividing the seconds portion since the pattern restarts every two seconds now (the least common multiple of 400 and 1000.) A similar calculation deals with the remainder of milliseconds.
GROUP BY
DATEDIFF(SECOND, '2020', t) / 2,
(DATEPART(SECOND, t) % 2 * 1000 + DATEPART(MILLISECOND, t)) / 400
The alternate expression offered in comments would become a bit more complicated in such a scenario in the same way. (This actually uses modulus to avoid essentially repeating much of the calculation):
DATEADD(MILLISECOND,
(DATEPART(SECOND, t) % 2 * 1000
+ DATEPART(MILLISECOND, t)) % -400,
t)