Search code examples
sqlpostgresqlquery-optimizationdatabase-performancepostgresql-performance

SELECT DISTINCT is slower than expected on my table in PostgreSQL


Here's my table schema:

CREATE TABLE tickers (
    product_id TEXT NOT NULL,
    trade_id INT NOT NULL,
    sequence BIGINT NOT NULL,
    time TIMESTAMPTZ,
    price NUMERIC NOT NULL,
    side TEXT NOT NULL,
    last_size NUMERIC NOT NULL,
    best_bid NUMERIC NOT NULL,
    best_ask NUMERIC NOT NULL,
    PRIMARY KEY (product_id, trade_id)
);

My application subscribes to Coinbase Pro's websocket on the "ticker" channel and inserts a row into the tickers table whenever it receives a message.

The table has nearly two million rows now.

I assumed that running SELECT DISTINCT product_id FROM tickers would be fast, but it takes around 500 to 600 milliseconds. Here's the output from EXPLAIN ANALYZE:

HashAggregate  (cost=47938.97..47939.38 rows=40 width=8) (actual time=583.105..583.110 rows=40 loops=1)
  Group Key: product_id
  ->  Seq Scan on tickers  (cost=0.00..42990.98 rows=1979198 width=8) (actual time=0.030..195.536 rows=1979243 loops=1)
Planning Time: 0.068 ms
Execution Time: 583.137 ms

If I turn off seq scanning by running SET enable_seqscan = FALSE (not something I want to actually rely on, just doing it for testing purposes) then the query is a little faster. Between 400 and 500 milliseconds. Here's the output from EXPLAIN ANALYZE:

Unique  (cost=0.43..80722.61 rows=40 width=8) (actual time=0.020..480.339 rows=40 loops=1)
  ->  Index Only Scan using tickers_pkey on tickers  (cost=0.43..75772.49 rows=1980051 width=8) (actual time=0.019..344.113 rows=1980160 loops=1)
        Heap Fetches: 328693
Planning Time: 0.064 ms
Execution Time: 480.386 ms

There are only 40 unique product IDs in the table. I assumed that since product_id is part of the composite primary key, and thus indexed, SELECT DISTINCT product_id FROM tickers would be much faster. But as it turns out, the query planner defaults to using a seq scan rather than the index, and even if I force it to use the index it's still slow (but a little faster than seq scan). I realize I could create another table to store nothing but unique product IDs and query that instead, but I'm more concerned with the reason(s) why my query on the tickers table is taking so long.

EDIT #1: I tried creating an index solely on the product_id column (CREATE INDEX idx_tickers_product_id ON tickers (product_id)) and the query planner still does a sequential scan unless I run SET enable_seqscan = FALSE first. But its performance is slightly better (10 to 50 milliseconds faster) than when the composite PK index is used.

EDIT #2: I tried Erwin Brandstetter's solution and it greatly improved the speed. There are now 2.25 million rows in the table and the execution only takes 0.75 milliseconds!

EDIT #3: I wanted to augment the accepted solution in order to retrieve the ticker count (max(trade_id) - min(trade_id) + 1) as well as the min and max time for each product id. I created a new question for this: How to use index skip emulation in PostgreSQL to retrieve distinct product IDs and also min/max for certain columns


Solution

  • While there is no index skip scan in Postgres yet, emulate it:

    WITH RECURSIVE cte AS (
       (   -- parentheses required
       SELECT product_id
       FROM   tickers
       ORDER  BY 1
       LIMIT  1
       )
       UNION ALL
       SELECT l.*
       FROM   cte c
       CROSS  JOIN LATERAL (
          SELECT product_id
          FROM   tickers t
          WHERE  t.product_id > c.product_id  -- lateral reference
          ORDER  BY 1
          LIMIT  1
          ) l
       )
    TABLE  cte;
    

    With an index on (product_id) and only 40 unique product IDs in the table this should be Fast. With capital F.
    The PK index on (product_id, trade_id) is good for it, too!

    With only very few rows per product_id (the opposite of your data distribution), DISTINCT / DISTINCT ON would be as fast or faster.

    Work to implement index skip scans is ongoing.
    See: