Search code examples
databasetime-seriesquestdb

Cannot SAMPLE BY when reading from an external parquet file


I am trying to use the built-in parquet demo file inside questdb (trades.parquet) to see how it works, but I am stuck as I cannot use SAMPLE BY.

I can do this

select * from read_parquet('trades.parquet');

And it works as expected. I can see the output below with the columns symbol, side, price, amount, and timestamp. The dataset is already sorted by increasing timestamp on the file.

screenshot of web console with symbol, side, price, amount, and timestamp, sorted by increasing timestamp

I now want to do the simplest SAMPLE BY query

select timestamp, avg(price) from read_parquet('trades.parquet') SAMPLE by 1m;

But I am getting an error base query does not provide ASC order over dedicated TIMESTAMP column.

I tried creating a table and then doing INSERT INTO tb SELECT * FROM read_parquet('trades.parquet') and I can then use SAMPLE BY. But in this case I am duplicating data and I was hoping to do time-series queries in-place over my file.


Solution

  • The problem with the SAMPLE BY is that it works over the designated timestamp. When you create a table without a designated timestamp, you get the same error.

    Since the parquet file is an external file, QuestDB doesn't know which column, if any, is the designated timestamp. It cannot just use the timestamp column, as for SAMPLE BY to work, the data must be sorted by increasing timestamp. When tables are created inside QuestDB, the increasing timestamp order is always implicit, as QuestDB forces that when persisting data on disk, but when using an external file, it cannot assume anything.

    In QuestDB we can always hint the designated timestamp by doing this:

    select timestamp, avg(price) from (read_parquet('trades.parquet') timestamp(timestamp)) sample by 1m;
    

    This should work fine in this case, as we know the file is already sorted by time. However, if we were not sure about the original order, we could always add an ORDER BY, as in

    select timestamp, avg(price) from ((read_parquet('trades.parquet') order by timestamp  ) timestamp(timestamp)) SAMPLE BY 1m
    

    Which is shorthand for

    select timestamp, avg(price) from (SELECT * FROM (read_parquet('trades.parquet') order by timestamp  ) timestamp(timestamp)) SAMPLE BY 1m
    

    Or if we prefer a CTE for clarity

    WITH timestamped AS (
      (
       SELECT * FROM read_parquet('trades.parquet') 
       order by timestamp  
      )    
      timestamp(timestamp ) 
      )
     SELECT  timestamp, avg(price) from timestamped SAMPLE BY 1m;