Search code examples
sqlpostgresqlquery-optimizationdatabase-performancepostgresql-performance

Index skip scan emulation to retrieve distinct product IDs and min/max for additional columns


Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ NOT NULL,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

CREATE INDEX idx_tickers_product_id_time ON tickers (product_id, time);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has over two million rows now.

I learned how to use index skip scan emulation (see: SELECT DISTINCT is slower than expected on my table in PostgreSQL) in PostgreSQL in order to quickly retrieve distinct product_id values from this table, rather than using the slower SELECT DISTINCT method.

I also want to retrieve min/max values for other columns. Here's what I came up with. It takes ~2.9 milliseconds over 2.25 rows.

Is there a better way to accomplish this?

WITH product_ids AS (
    WITH RECURSIVE cte AS (
       (   -- parentheses required
           SELECT product_id
           FROM tickers
           ORDER BY 1
           LIMIT 1
       )
       UNION ALL
       SELECT l.*
       FROM cte c
       CROSS JOIN LATERAL (
          SELECT product_id
          FROM tickers t
          WHERE t.product_id > c.product_id  -- lateral reference
          ORDER BY 1
          LIMIT 1
          ) l
       )
    TABLE cte
)
SELECT
    product_id,
    (SELECT (MAX(trade_id) - MIN(trade_id) + 1) FROM tickers WHERE product_id = product_ids.product_id) AS ticker_count,
    (SELECT MIN(time) FROM tickers WHERE product_id = product_ids.product_id) AS min_time,
    (SELECT MAX(time) FROM tickers WHERE product_id = product_ids.product_id) AS max_time
FROM product_ids
ORDER BY ticker_count DESC

Solution

  • Query

    Using the existing index on (product_id, time) we can get two for the price of one, i.e. fetch product_id and minimum time in one index scan:

    WITH RECURSIVE product_ids AS (
       (   -- parentheses required
       SELECT product_id, time AS min_time
       FROM   tickers
       ORDER  BY 1, 2
       LIMIT  1
       )
       UNION ALL
       SELECT l.*
       FROM   product_ids p
       CROSS JOIN LATERAL (
          SELECT t.product_id, t.time
          FROM   tickers t
          WHERE  t.product_id > p.product_id
          ORDER  BY 1, 2
          LIMIT  1
          ) l
       )
    SELECT product_id, min_time
        , (SELECT MAX(time) FROM tickers WHERE product_id = p.product_id) AS max_time
        , (SELECT MAX(trade_id) - MIN(trade_id) + 1 FROM tickers WHERE product_id = p.product_id) AS ticker_count
    FROM   product_ids p
    ORDER  BY ticker_count DESC;
    

    Also, no need for a second CTE wrapper.

    Indexes

    Currently you have two indexes: The PK index on (product_id, trade_id), and another one on (product_id, time). You might optimize this by reversing the column order in one of both. Like:

    PRIMARY KEY (trade_id, product_id)
    

    Logically equivalent, but typically more efficient as it covers a wider range of possible queries. See (again):

    We only need the existing index on (product_id, time), so no direct effect on this query.