Search code examples
postgresqlgreatest-n-per-group

PostgreSQL fast query to get most recent rows


I have a PostgreSQL table with a bunch of events, each containing a timestamp, id, and action. I expect a few million events per day, and expect the table to grow to several billion rows eventually.

I want to query the most recent events for each item (a few hundred thousand distinct ids) before a given date, but currently the query is extremely slow, taking 1-2 hours minimum (the table currently has around 100 million rows). Is there a way to speed up this query?

SELECT a.* FROM (
    with events as (
        SELECT
          ROW_NUMBER() OVER (PARTITION BY item ORDER BY time_stamp DESC) AS rn,
          *
        FROM event_updates
        WHERE time_stamp < '2023-05-01'
    )
    SELECT * FROM events WHERE rn=1 ORDER BY item
) a;

DDL for table

CREATE TABLE "event_updates" (
  "id" int4 NOT NULL DEFAULT nextval("event_updates"::regclass),
  "time_stamp" timestamptz(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "item" varchar(32) COLLATE "pg_catalog"."default" NOT NULL,
  "event_type" int2,
)
PARTITION BY ();
ALTER TABLE "event_updates" OWNER TO "owners";

Solution

  • First of all, you need an index on item and the timestamp. Second, you select the max(timestamp) for each item and then select the record you need:

    CREATE INDEX idx_event_updates_item_time_stamp
        ON event_updates(item, time_stamp);
    
    SELECT *
    FROM event_updates
        JOIN (SELECT item
                   , MAX(time_stamp) time_stamp
              FROM event_updates
              GROUP BY item
              ) sub USING (item, time_stamp);
    

    You might want to create the index per partition, concurrently, to avoid locking issues. It takes longer however.