Search code examples
sqlperformancepostgresqlindexingpostgresql-performance

Optimizing a row exclusion query


I am designing a mostly read-only database containing 300,000 documents with around 50,000 distinct tags, with each document having 15 tags on average. For now, the only query I care about is selecting all documents with no tag from a given set of tags. I'm only interested in the document_id column (no other columns in the result).

My schema is essentially:

CREATE TABLE documents (
    document_id  SERIAL  PRIMARY KEY,
    title        TEXT
);

CREATE TABLE tags (
    tag_id  SERIAL  PRIMARY KEY,
    name    TEXT    UNIQUE
);

CREATE TABLE documents_tags (
    document_id    INTEGER  REFERENCES documents,
    tag_id         INTEGER  REFERENCES tags,

    PRIMARY KEY (document_id, tag_id)
);

I can write this query in Python by pre-computing the set of documents for a given tag, which reduces the problem to a few fast set operations:

In [17]: %timeit all_docs - (tags_to_docs[12345] | tags_to_docs[7654])
100 loops, best of 3: 13.7 ms per loop

Translating the set operations to Postgres doesn't work that fast, however:

stuff=# SELECT document_id AS id FROM documents WHERE document_id NOT IN (
stuff(#     SELECT documents_tags.document_id AS id FROM documents_tags
stuff(#     WHERE documents_tags.tag_id IN (12345, 7654)
stuff(# );
   document_id
---------------
    ...
Time: 201.476 ms
  • Replacing NOT IN with EXCEPT makes it even slower.
  • I have btree indexes on document_id and tag_id in all three tables and another one on (document_id, tag_id).
  • The default memory limits on Postgres' process have been increased significantly, so I don't think Postgres is misconfigured.

How do I speed up this query? Is there any way to pre-compute the mapping between like I did with Python, or am I thinking about this the wrong way?


Here is the result of an EXPLAIN ANALYZE:

EXPLAIN ANALYZE
SELECT document_id AS id FROM documents
WHERE document_id NOT IN (
    SELECT documents_tags.documents_id AS id FROM documents_tags
    WHERE documents_tags.tag_id IN (12345, 7654)
);
                                                                            QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on documents  (cost=20280.27..38267.57 rows=83212 width=4) (actual time=176.760..300.214 rows=20036 loops=1)
   Filter: (NOT (hashed SubPlan 1))
   Rows Removed by Filter: 146388
   SubPlan 1
     ->  Bitmap Heap Scan on documents_tags  (cost=5344.61..19661.00 rows=247711 width=4) (actual time=32.964..89.514 rows=235093 loops=1)
           Recheck Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
           Heap Blocks: exact=3300
           ->  Bitmap Index Scan on documents_tags__tag_id_index  (cost=0.00..5282.68 rows=247711 width=0) (actual time=32.320..32.320 rows=243230 loops=1)
                 Index Cond: (tag_id = ANY ('{12345,7654}'::integer[]))
 Planning time: 0.117 ms
 Execution time: 303.289 ms
(11 rows)

Time: 303.790 ms

The only settings I changed from the default configuration were:

shared_buffers = 5GB
temp_buffers = 128MB
work_mem = 512MB
effective_cache_size = 16GB

Running Postgres 9.4.5 on a server with 64GB RAM.


Solution

  • Optimize setup for read performance

    Your memory settings seem reasonable for a 64GB server - except maybe work_mem = 512MB. That's high. Your queries are not particularly complex and your tables are not that big.

    4.5 million rows (300k x 15) in the simple junction table documents_tags should occupy ~ 156 MB and the PK another 96 MB. For your query you typically don't need to read the whole table, just small parts of the index. For "mostly read-only" like you have, you should see index-only scans on the index of the PK exclusively. You don't need nearly as much work_mem - which may not matter much - except where you have many concurrent queries. Quoting the manual:

    ... several running sessions could be doing such operations concurrently. Therefore, the total memory used could be many times the value of work_mem; it is necessary to keep this fact in mind when choosing the value.

    Setting work_mem too high may actually impair performance:

    I suggest to reduce work_mem to 128 MB or less to avoid possible memory starvation- unless you have other common queries that require more. You can always set it higher locally for special queries.

    There are several other angles to optimize read performance:

    Key problem: leading index column

    All of this may help a little. But the key problem is this:

    PRIMARY KEY (document_id, tag_id)

    300k documents, 2 tags to exclude. Ideally, you have an index with tag_id as leading column and document_id as 2nd. With an index on just (tag_id) you can't get index-only scans. If this query is your only use case, change your PK as demonstrated below.

    Or probably even better: you can create an additional plain index on (tag_id, document_id) if you need both - and drop the two other indexes on documents_tags on just (tag_id) and (document_id). They offer nothing over the two multicolumn indexes. The remaining 2 indexes (as opposed to 3 indexes before) are smaller and superior in every way. Rationale:

    While being at it, I suggest to also CLUSTER the table using the new PK, all in one transaction, possibly with some extra maintenance_work_mem locally:

    BEGIN;
    SET LOCAL maintenance_work_mem = '256MB';
    
    ALTER TABLE documents_tags 
      DROP CONSTRAINT documents_tags_pkey
    , ADD  PRIMARY KEY (tag_id, document_id);  -- tag_id first.
    
    CLUSTER documents_tags USING documents_tags_pkey;
    
    COMMIT;

    Don't forget to:

    VACUUM ANALYZE documents_tags;
    

    Queries

    The query itself is run-of-the-mill. Here are the 4 standard techniques:

    NOT IN is - to quote myself:

    Only good for small sets without NULL values

    Your use case exactly: all involved columns NOT NULL and your list of excluded items is very short. Your original query is a hot contender.

    NOT EXISTS and LEFT JOIN / IS NULL are always hot contenders. Both have been suggested in other answers. LEFT JOIN has to be an actual LEFT [OUTER] JOIN, though.

    EXCEPT ALL would be shortest, but often not as fast.

    1. NOT IN

    SELECT document_id
    FROM   documents d
    WHERE  document_id NOT IN (
       SELECT document_id  -- no need for column alias, only value is relevant
       FROM   documents_tags
       WHERE  tag_id IN (12345, 7654)
       );
    

    2. NOT EXISTS

    SELECT document_id
    FROM   documents d
    WHERE  NOT EXISTS (
       SELECT FROM documents_tags
       WHERE  document_id = d.document_id
       AND    tag_id IN (12345, 7654)
       );
    

    3. LEFT JOIN / IS NULL

    SELECT d.document_id
    FROM   documents d
    LEFT   JOIN documents_tags dt ON dt.document_id = d.document_id
                                 AND dt.tag_id IN (12345, 7654)
    WHERE  dt.document_id IS NULL;

    4. EXCEPT ALL

    SELECT document_id
    FROM   documents
    EXCEPT ALL               -- ALL keeps duplicate rows and makes it faster
    SELECT document_id
    FROM   documents_tags
    WHERE  tag_id IN (12345, 7654);
    

    Benchmark

    To put my theories to the test.

    Test setup

    SET random_page_cost = 1.1;
    SET work_mem = '128MB';
    
    CREATE TABLE documents (
      document_id serial PRIMARY KEY
    , title       text
    );
    
    -- CREATE TABLE tags ( ...  -- irrelevant for this query    
    
    CREATE TABLE documents_tags (
      document_id int  REFERENCES documents
    , tag_id      int  -- REFERENCES tags  -- irrelevant for test
     -- no PK yet, to test seq scan
     -- it's also faster to create the PK after filling the big table
    );
    
    INSERT INTO documents (title)
    SELECT 'some dummy title ' || g
    FROM   generate_series(1, 300000) g;
    
    INSERT INTO documents_tags(document_id, tag_id)
    SELECT i.*
    FROM   documents d
    CROSS  JOIN LATERAL (
       SELECT DISTINCT d.document_id, ceil(random() * 50000)::int
       FROM   generate_series (1,30)
       WHERE  random() > .5
       ) i;
    
    ALTER TABLE documents_tags ADD PRIMARY KEY (document_id, tag_id);  -- current idx
        
    VACUUM ANALYZE documents_tags;
    VACUUM ANALYZE documents;
    

    Note that rows in documents_tags are physically clustered by document_id due to the way I filled the table - which is likely your current situation as well.

    Original Test with Postgres 9.5 on 2016-08-06

    On my old laptop with 4 GB RAM.

    3 test runs with each of the 4 queries, best of 5 every time to exclude caching effects.

    Test 1: With documents_tags_pkey like you have it. Index and physical order of rows are bad for our query.
    Test 2: Recreate the PK on (tag_id, document_id) like suggested.
    Test 3: CLUSTER on new PK. Execution time of EXPLAIN ANALYZE in ms:

      time in ms   | Test 1 | Test 2 | Test 3
    ---------------+--------+--------+-------
    1. NOT IN      | 654    |  70    |  71  -- winner!
    2. NOT EXISTS  | 684    | 103    |  97
    3. LEFT JOIN   | 685    |  98    |  99
    4. EXCEPT ALL  | 833    | 255    | 250

    Conclusions

    • Key element is the right index with leading tag_id - for queries involving few tag_id and many document_id.
      To be precise, it's not important that there are more distinct document_id than tag_id. This could be the other way round as well. Btree indexes basically perform the same with any order of columns. It's the fact that the most selective predicate in your query filters on tag_id. And that's faster on the leading index column(s).

    • The winning query for few tag_id to exclude is your original with NOT IN.

    • NOT EXISTS and LEFT JOIN / IS NULL result in the same query plan. For more than a couple of dozen IDs, I expect these to scale better ...

    • In a read-only situation you'll see index-only scans exclusively, so the physical order of rows in the table becomes irrelevant. Hence, test 3 did not bring any more improvements.

    • If writes to the table happen and autovacuum can't keep up, you'll see (bitmap) index scans. Physical clustering is important for those.

    Retest with Postgres 15 on 2023-04-11

    With plenty of work_mem on a newer laptop with 64 GB RAM.

    I skipped "Test 2", and added "Test 1+" instead. That's the same as Test 1, but with increased max_parallel_workers_per_gather so that 3 parallel workers were used instead of just 2.

      time in ms   | Test 1 | Test 1+ | Test 3
    ---------------+--------+---------+-------
    1. NOT IN      | 170    | 154     |  42
    2. NOT EXISTS  | 153    | 132     |  35  -- winner!
    3. LEFT JOIN   | 154    | 135     |  35  -- same query plan
    4. EXCEPT ALL  | 276    | 254     | 137

    As expected, performance generally improved.

    NOT IN could not keep up, since it poses some inherent optimization barriers. The clear winner is now NOT EXISTS or LEFT JOIN / NOT NULL, using all index-only scans.

    "Test 1" and "Test 1+" look better than they were. Those used 2 and 3 workers, so multiply the numbers by that factor go get actual system load.