I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works, but I am stuck as I cannot use SAMPLE BY
.
I can do this
select * from read_parquet('trades.parquet');
And it works as expected. I can see the output below with the columns symbol, side, price, amount, and timestamp
. The dataset is already sorted by increasing timestamp on the file.
I now want to do the simplest SAMPLE BY
query
select timestamp, avg(price) from read_parquet('trades.parquet') SAMPLE by 1m;
But I am getting an error base query does not provide ASC order over dedicated TIMESTAMP column
.
I tried creating a table and then doing INSERT INTO tb SELECT * FROM read_parquet('trades.parquet')
and I can then use SAMPLE BY
. But in this case I am duplicating data and I was hoping to do time-series queries in-place over my file.
The problem with the SAMPLE BY
is that it works over the designated timestamp. When you create a table without a designated timestamp, you get the same error.
Since the parquet file is an external file, QuestDB doesn't know which column, if any, is the designated timestamp. It cannot just use the timestamp column, as for SAMPLE BY
to work, the data must be sorted by increasing timestamp. When tables are created inside QuestDB, the increasing timestamp order is always implicit, as QuestDB forces that when persisting data on disk, but when using an external file, it cannot assume anything.
In QuestDB we can always hint the designated timestamp by doing this:
select timestamp, avg(price) from (read_parquet('trades.parquet') timestamp(timestamp)) sample by 1m;
This should work fine in this case, as we know the file is already sorted by time. However, if we were not sure about the original order, we could always add an ORDER BY
, as in
select timestamp, avg(price) from ((read_parquet('trades.parquet') order by timestamp ) timestamp(timestamp)) SAMPLE BY 1m
Which is shorthand for
select timestamp, avg(price) from (SELECT * FROM (read_parquet('trades.parquet') order by timestamp ) timestamp(timestamp)) SAMPLE BY 1m
Or if we prefer a CTE for clarity
WITH timestamped AS (
(
SELECT * FROM read_parquet('trades.parquet')
order by timestamp
)
timestamp(timestamp )
)
SELECT timestamp, avg(price) from timestamped SAMPLE BY 1m;