Search code examples
sqlpostgresqlindexingsql-execution-planpostgresql-performance

Efficient PostgreSQL query on timestamp using index or bitmap index scan?


In PostgreSQL, I have an index on a date field on my tickets table. When I compare the field against now(), the query is pretty efficient:

# explain analyze select count(1) as count from tickets where updated_at > now();
                                                             QUERY PLAN                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=90.64..90.66 rows=1 width=0) (actual time=33.238..33.238 rows=1 loops=1)
   ->  Index Scan using tickets_updated_at_idx on tickets  (cost=0.01..90.27 rows=74 width=0) (actual time=0.016..29.318 rows=40250 loops=1)
         Index Cond: (updated_at > now())
Total runtime: 33.271 ms

It goes downhill and uses a Bitmap Heap Scan if I try to compare it against now() minus an interval.

# explain analyze select count(1) as count from tickets where updated_at > (now() - '24 hours'::interval);
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate  (cost=180450.15..180450.17 rows=1 width=0) (actual time=543.898..543.898 rows=1 loops=1)
->  Bitmap Heap Scan on tickets  (cost=21296.43..175963.31 rows=897368 width=0) (actual time=251.700..457.916 rows=924373 loops=1)
     Recheck Cond: (updated_at > (now() - '24:00:00'::interval))
     ->  Bitmap Index Scan on tickets_updated_at_idx  (cost=0.00..20847.74 rows=897368 width=0)     (actual time=238.799..238.799 rows=924699 loops=1)
           Index Cond: (updated_at > (now() - '24:00:00'::interval))
Total runtime: 543.952 ms

Is there a more efficient way to query using date arithmetic?


Solution

  • The 1st query expects to find rows=74, but actually finds rows=40250.
    The 2nd query expects to find rows=897368 and actually finds rows=924699.

    Of course, processing 23 x as many rows takes considerably more time. So your actual times are not surprising.

    Statistics for data with updated_at > now() are outdated. Run:

    ANALYZE tickets;
    

    and repeat your queries. And you seriously have data with updated_at > now()? That sounds wrong.

    It's not surprising, however, that statistics are outdated for data most recently changed. That's in the logic of things. If your query depends on current statistics, you have to run ANALYZE before you run your query.

    Also test with (in your session only):

    SET enable_bitmapscan = off;
    

    and repeat your second query to see times without bitmap index scan.

    Why bitmap index scan for more rows?

    A plain index scan fetches rows from the heap sequentially as found in the index. That's simple, dumb and without overhead. Fast for few rows, but may end up more expensive than a bitmap index scan with a growing number of rows.

    A bitmap index scan collects rows from the index before looking up the table. If multiple rows reside on the same data page, that saves repeated visits and can make things considerably faster. The more rows, the greater the chance, a bitmap index scan will save time.

    For even more rows (around 5% of the table, heavily depends on actual data), the planner switches to a sequential scan of the table and doesn't use the index at all.

    The optimum would be an index only scan, introduced with Postgres 9.2. That's only possible if some preconditions are met. If all relevant columns are included in the index, the index type support it and the visibility map indicates that all rows on a data page are visible to all transactions, that page doesn't have to be fetched from the heap (the table) and the information in the index is enough.

    The decision depends on your statistics (how many rows Postgres expects to find and their distribution) and on cost settings, most importantly random_page_cost, cpu_index_tuple_cost and effective_cache_size.