Search code examples
postgresqlindexingtimestamppostgresql-performance

Faster search by indexing timestamp column on postgresql?


I have a PostgreSQL database and a table that consists of events. Those events have column end_time, which has a type timestamp (without timezone info). In my app I make frequent queries to the table, that try to select all events that happen in the future. So basically I'm doing this kind of SQL query:

SELECT * FROM events WHERE end_time >= ?::timestamp

I do not have an index on the end_time column currently. I fear that once my table row size grows larger (which it already has done quite much actually), the search query for future events will become slower? Because now the database search has to go through all rows to choose the ones that happen (or for more precisely, ends) in the future. I've used indexes before but can't say I'm the most familiar with them. I wonder if indexing the end_time column by creating the default Postgres index to it would increase the performance of the query? I don't have a real problem yet, but I don't want to wait for it to appear once the amount of data increases. Because then it's kind of too late, at least the user experience of the end application has decreased then.

I want to point out that I do use the timestamp without timezone, as my app always assumes local time and I should not need the timezone info. But I heard that it might have an effect with the indexing? Also my timestamps are not constrained in any way currently. So they could in theory be from now to the infinite future. I wonder if setting some constraints could make the indexing better? Something like the event time should be within 15 year or something?

One another option is that I would move events to another table that are in the past (archived_events). So that the table size of the events would not get too big. I could for example have a cron job that would do it regularly.

Also I heard that running analyze/explain to the database can actually improve it's performance? If this is the case, how often should I run those?

PostgreSQL version: 12.3


Solution

  • I wonder if indexing the end_time column [...] would increase the performance of the query?

    If Postgres expects that only a few percent or less qualify (have end_time in the future), it will use index on the column in an "index scan" or a "bitmap index scan".

    If that estimation not too far off, it will also actually increase performance. That's why you should have autovacuum enabled as is the default: to keep column statistics up to date.

    If you don't actually need all columns from your query (SELECT *) - which you typically don't - then only list the columns you actually need to make it faster yet. Maybe even allow an "index only scan". See:

    I wonder if setting some constraints could make the indexing better? Something like the event time should be within 15 year or something?

    No. No effect on your query whatsoever. The number of rows in the future is the deciding factor.

    I would move events to another table that are in the past (archived_events) ...?

    Btree indexes scale excellently. Meaning, as long as only few rows qualify, the number of eliminated rows hardly matters. If your table is huge (millions or billions of rows) and most of them in the past, a partial index might be better, mostly due to the reduction of index size and index maintenance cost.

    The special difficulty: "now" is a dynamic value. An index definition requires immutable values. The workaround is to pick an arbitrary "now" to cut off the bulk of rows. Something like:

    CREATE INDEX ON events(end_time) WHERE end_time > '2021-01-30';
    

    Modern Postgres is smart enough to understand it can use the index for future dates. Older versions might need a redundant WHERE clause to make it understand that the partial index is applicable:

    SELECT * FROM events
    WHERE  end_time >= ?::timestamp
    AND    end_time > '2021-01-30';  -- match index
    

    The usefulness of the index deteriorates over time, also depending on row churn. You might recreate the index from time to time to cut off more rows.


    Aside, don't let the type name timestamp with time zone misguide you. It does not store time zone information. And it is typically the best choice. See: