Search code examples
databasetime-seriesquestdb

Calculate Median on QuestDB


I have this query working on QuestDB, in which I get max, min, and avg price:

select approx_percentile(price, 0.5), max(price), min(price), avg(price)
from trades
where symbol = 'BTC-USDT' and timestamp in yesterday();

This works fine, but now I want to calculate the median. I checked the docs and I couldn't find any function for that, but since the median is equivalent to the 50th percentile and I saw QuestDB supports approx_percentile I tried this:

select approx_percentile(price, 0.5,1),  approx_percentile(price, 0.5,5), max(price), min(price), avg(price)
from trades
where symbol = 'BTC-USDT' and timestamp in yesterday();

And it worked fine (different precision values gave different approximations, but that's to be expected). My problem comes if I have any negative values in the table, as I get this error:

out of bounds for histogram, current covered range [6.668014432879854E240, 1.3656093558537942E244) cannot be extended any further. Caused by: [-1] Negative values cannot be recorded

A workaround could be shifting the values by adding the values by adding the mininum price on my dataset to every other price, then calculating the percentile, then shifting back by substracting the min_price. I tried that (query below for reference) and it works, but this is hacky and makes for a non-intuitive query.

I wonder if anyone has a better suggestion.

WITH
min_price AS (
select ABS(min(price)) as min_price from trades
where symbol = 'BTC-USDT' and timestamp in yesterday()
),
shifted_data AS (
    SELECT price + min_price AS shifted_price
    FROM trades cross join min_price
where symbol = 'BTC-USDT' and timestamp in yesterday()
) 
    SELECT approx_percentile(shifted_price, 0.5, 5) - min_price AS shifted_median
    FROM shifted_data cross join min_price;

Solution

  • At the moment QuestDB does not have a mean function and the approx_percentile implementation is based on the HdrHistogram library which is very efficient as a trade-off of not being able to deal with negative values. While a core QuestDB engineer confirmed on a slack channel that this will be addressed in the future, they couldn't provide any deadline.

    As another workaround, we can make a slightly simpler query. The mean value on a dataset is the one that sits right in the middle, with as many rows before as after it. With this information, if we could sort the dataset by the target column, count the total number of rows, then get the value for the row at the exact middle position, we have the median.

    with prices_and_stats AS (
    select price, row_number OVER(order by price) as row_num, count(price) OVER() as total_rows
    from trades
    where symbol = 'BTC-USDT' and timestamp in yesterday()
    )
    SELECT price from prices_and_stats where row_num::double IN (
        (total_rows + 1) / 2,  -- For odd rows
        (total_rows + 2) / 2   -- For even rows
    );
    

    The query uses the row_number and count window functions so we can get both values in a single query (as opposed to the first workaround, that had to first find the minimum value and then cross join), and then in the main query we only need to find which is the middle row. We divide the total row numbers + 1 (or plus 2 for even rows) and divide by two, then we get the row at that position.

    It is not ideal, but until QuestDB supports either median or percentiles with negative values, this would do the trick and performs quite well.