python postgresql query-optimization amazon-rds database-indexes

Using index better than sequential scan when every hundredth row is needed, but only with explicit list of values

I have a table (under RDS Postgres v. 15.4 instance db.m7g.large):

CREATE TABLE MyTable (
    content_id integer, 
    part integer, 
    vector "char"[]
);

There is a B-Tree index on content_id. My data consists of 100M rows. There are 1M (0 .. 10^6-1) different values of content_id. For each value of content_id there are 100 (0..99) values of part. The column vector contains an array of 384 byte-size numbers if content_id is divisible by 100 without a remainder. It is NULL otherwise.

I have constructed this artificial data to test performance of the following query submitted from a Python script (it will become clear in a moment why I left it in Python for the question):

query = f"""
WITH 
    Vars(key) as (
        VALUES (array_fill(1, ARRAY[{384}])::vector)
    ),
    Projection as (
        SELECT *
        FROM MyTable P
        WHERE P.content_id in ({str(list(range(0, 999999, 100)))[1:-1]})
    )
SELECT P.content_id, P.part
FROM Projection P, Vars
ORDER BY P.vector::int[]::vector <#> key
LIMIT {10};
"""

<#> is the dot product operator of the pgvector extension, and vector is the type defined by that extension, which to my understanding is similar to real[].

Note that the WHERE clause specifies an explicit list of 10K values of content_id (which correspond to 1M rows, i.e. every hundredth row in the table). Because of this large explicit list, I have to leave my query in Python and cannot run EXPLAIN ANALYZE.

The above query takes ~6 seconds to execute.

However, when I prepend this query with SET enable_seqscan = off; the query takes only ~3 seconds.

Question 1: Given that we need every 100-th row and that much of computation is about computing the dot products and ordering by them, how come sequential scans are not better than using the index? (All the more so, I can't understand how using the index could result in an improvement by a factor of 2.)

Question 2: How come this improvement disappears if I change the explicit list of values for generate_series as shown below?

WHERE content_id IN (SELECT generate_series(0, 999999, 100))

Now, for this latter query I have the output for EXPLAIN ANALYZE:

 Limit  (cost=1591694.63..1591695.46 rows=10 width=24) (actual time=6169.118..6169.125 rows=10 loops=1)
   ->  Result  (cost=1591694.63..2731827.31 rows=13819790 width=24) (actual time=6169.117..6169.122 rows=10 loops=1)
         ->  Sort  (cost=1591694.63..1626244.11 rows=13819790 width=424) (actual time=6169.114..6169.117 rows=10 loops=1)
               Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 34kB
               ->  Nested Loop  (cost=194.30..1293053.94 rows=13819790 width=424) (actual time=2.629..6025.693 rows=1000000 loops=1)
                     ->  HashAggregate  (cost=175.02..177.02 rows=200 width=4) (actual time=2.588..5.321 rows=10000 loops=1)
                           Group Key: generate_series(0, 999999, 100)
                           Batches: 1  Memory Usage: 929kB
                           ->  ProjectSet  (cost=0.00..50.02 rows=10000 width=4) (actual time=0.002..0.674 rows=10000 loops=1)
                                 ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
                     ->  Bitmap Heap Scan on mytable p  (cost=19.28..4204.85 rows=1382 width=416) (actual time=0.007..0.020 rows=100 loops=10000)
                           Recheck Cond: (content_id = (generate_series(0, 999999, 100)))
                           Heap Blocks: exact=64444
                           ->  Bitmap Index Scan on idx_content_on_mytable  (cost=0.00..18.93 rows=1382 width=0) (actual time=0.005..0.005 rows=100 loops=10000)
                                 Index Cond: (content_id = (generate_series(0, 999999, 100)))
 Planning Time: 0.213 ms
 Execution Time: 6169.260 ms
(18 rows)

UPDATE @jjanes commented regarding my first question:

Assuming your data is clustered on content_id, you need 100 consecutive rows out of every set of 10,000 rows. That is very different than needing every 100th row.

If I understand correctly, this means that each of the 10K look-ups of the index returns a range rather than 100 individual rows. That range can be then scanned sequentially.

Following are the outputs of EXPLAIN (ANALYZE, BUFFERS) for all three queries:

The original query:

 Limit  (cost=1430170.64..1430171.81 rows=10 width=16) (actual time=6300.232..6300.394 rows=10 loops=1)
   Buffers: shared hit=55868 read=436879
   I/O Timings: shared/local read=1027.617
   ->  Gather Merge  (cost=1430170.64..2773605.03 rows=11514348 width=16) (actual time=6300.230..6300.391 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=55868 read=436879
         I/O Timings: shared/local read=1027.617
         ->  Sort  (cost=1429170.62..1443563.55 rows=5757174 width=16) (actual time=6291.083..6291.085 rows=8 loops=3)
               Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 25kB
               Buffers: shared hit=55868 read=436879
               I/O Timings: shared/local read=1027.617
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Seq Scan on mytable p  (cost=25.00..1304760.16 rows=5757174 width=16) (actual time=1913.156..6237.441 rows=333333 loops=3)
                     Filter: (content_id = ANY ('{0,100,...,999900}'::integer[]))
                     Rows Removed by Filter: 33000000
                     Buffers: shared hit=55754 read=436879
                     I/O Timings: shared/local read=1027.617
 Planning:
   Buffers: shared hit=149
 Planning Time: 8.444 ms
 Execution Time: 6300.452 ms
(24 rows)

The query with SET enable_seqscan = off;

 Limit  (cost=1578577.14..1578578.31 rows=10 width=16) (actual time=3121.539..3123.430 rows=10 loops=1)
   Buffers: shared hit=95578
   ->  Gather Merge  (cost=1578577.14..2922011.54 rows=11514348 width=16) (actual time=3121.537..3123.426 rows=10 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=95578
         ->  Sort  (cost=1577577.12..1591970.05 rows=5757174 width=16) (actual time=3108.995..3108.997 rows=9 loops=3)
               Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
               Sort Method: top-N heapsort  Memory: 25kB
               Buffers: shared hit=95578
               Worker 0:  Sort Method: top-N heapsort  Memory: 25kB
               Worker 1:  Sort Method: top-N heapsort  Memory: 25kB
               ->  Parallel Bitmap Heap Scan on mytable p  (cost=184260.30..1453166.66 rows=5757174 width=16) (actual time=42.277..3057.887 rows=333333 loops=3)
                     Recheck Cond: (content_id = ANY ('{0,100,...,999900}'::integer[]))
                           Buffers: shared hit=40000
 Planning:
   Buffers: shared hit=149
 Planning Time: 8.591 ms
 Execution Time: 3123.638 ms
(23 rows)

Like 2, but with generate_series:

 Limit  (cost=1591694.63..1591694.66 rows=10 width=16) (actual time=6155.109..6155.114 rows=10 loops=1)
   Buffers: shared hit=104447
   ->  Sort  (cost=1591694.63..1626244.11 rows=13819790 width=16) (actual time=6155.107..6155.111 rows=10 loops=1)
         Sort Key: ((((p.vector)::integer[])::vector <#> '[1,1,...,1]'::vector))
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=104447
         ->  Nested Loop  (cost=194.30..1293053.94 rows=13819790 width=16) (actual time=2.912..6034.798 rows=1000000 loops=1)
               Buffers: shared hit=104444
               ->  HashAggregate  (cost=175.02..177.02 rows=200 width=4) (actual time=2.870..5.484 rows=10000 loops=1)
                     Group Key: generate_series(0, 999999, 100)
                     Batches: 1  Memory Usage: 929kB
                     ->  ProjectSet  (cost=0.00..50.02 rows=10000 width=4) (actual time=0.002..0.736 rows=10000 loops=1)
                           ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.000..0.001 rows=1 loops=1)
               ->  Bitmap Heap Scan on mytable p  (cost=19.28..4204.85 rows=1382 width=416) (actual time=0.007..0.020 rows=100 loops=10000)
                     Recheck Cond: (content_id = (generate_series(0, 999999, 100)))
                     Heap Blocks: exact=64444
                     Buffers: shared hit=104444
                     ->  Bitmap Index Scan on idx_content_on_mytable  (cost=0.00..18.93 rows=1382 width=0) (actual time=0.005..0.005 rows=100 loops=10000)
                           Index Cond: (content_id = (generate_series(0, 999999, 100)))
                           Buffers: shared hit=40000
 Planning:
   Buffers: shared hit=180
 Planning Time: 1.012 ms
 Execution Time: 6155.251 ms
(24 rows)

Solution

plans 2 vs 3

The difference between plans 2 (literal IN list with bitmap) and 3 (generate_series with bitmap) is easy to explain. PostgreSQL doesn't know how to parallelize the function scan over generate_series, so it doesn't use parallelization in that case. This is indicated by the lack of a "Gather..." node in that plan.

My understanding of AWS graviton instances is that their vCPU count is equal to their actual CPU count, so your ".large" machine has 2 actual processors to spread the work over. And that plan is about twice as fast, so everything there makes sense. (It tried to spread the work over 3 processes (2 workers plus the leader), but those 3 processes had to share 2 processors). But, is this actually a good thing? If the other processor would otherwise go unused, yes. But if your real server will be busy enough that all CPU are almost always in use, then you are just robbing Peter to pay Paul. Your speed of queries run in isolation will be faster, but your total throughput when not running in isolation would not be better.

plans 1 vs 2

The reason for the difference in actual execution time between plan 1 and plan 2 is also easy to see. Most of the time is spend in processing the tuples (only 1028 ms out of 6300 ms is spent in IO) and the 1st plan has to process all of the tuples, while the 2nd plan uses an index to rule most of those tuples out without actually processing them. (Another part of the difference might be just that everything is already in memory for plan 2, presumably because it was run with a "hot cache", but that can only explain at most 1/3 of the difference as the IO only took one second to do on the slower plan)

So why doesn't the planner choose the 2nd plan? This is more speculative. It looks like the planner overestimates the cost of IO, and therefore over estimates the importance of optimizing IO. The sequential plan is entirely sequential, while the bitmap scan is only kind-of sequential, so the first one should be better. And then it overestimates the importance of that difference.

Also, it greatly overestimates the number of tuples actually returned by that part of the plan, expected 5757174 vs actual 333333. For the seq scan itself, this doesn't matter as inspecting and throwing away a tuple is about the same work as inspecting and keeping it. But for the bitmap heap scan, this isn't the case. Tuples can be thrown away though the use of the index without ever inspecting them, so overestimating how many tuples will be returned underestimates the utility of the bitmap. (Now that I type this up, I think this part is more important than the one given in my previous paragraph.) Why is the estimate so bad? First, try to ANALYZE the table and see if that fixes the problem. If not, then inspect (or show us) the contents of pg_stats for attribute "content_id" of relation "MyTable". I also note that plan 2 is incomplete. You show a bitmap heap scan, but no corresponding bitmap index scan. But you must have removed it, as I don't think it would be possible for it not to be there. In this case, I doubt seeing this would change any of my conclusions, but it is disconcerting for it to be missing.

Why do I say it over-estimates the importance of IO? The seq scan accesses 55754 + 436879 = 492633 buffers. Those are actual counts, but there is no reason for the unshown estimates used by the planner for a simple sequential scan (that doesn't use TOAST, which I am assuming is the case) to differ significantly from the total actual counts. Assuming you haven't changed seq_page_cost from its default of 1.0, that contributes 492633/1304760.16 = 37.8% of the planned cost of that node, while the actual timing is much less at 1027.617/6237.441 = 16.5%.