postgresql indexing full-text-search database-partitioning

Postgresql partition pruning and btree_gin indexes not playing ball

The setup can be roughly summarized as follows:

CREATE TYPE bird_species AS ENUM('sparrow', 'chicken', 'hawk');

CREATE TABLE bird (
  id INTEGER GENERATED ALWAYS AS IDENTITY,
  species bird_species NOT NULL,
  description TEXT,
  PRIMARY KEY(species, id)
) PARTITION BY LIST(species);

CREATE TABLE bird_sparrow PARTITION OF bird
FOR VALUES IN ('sparrow');

CREATE TABLE bird_chicken PARTITION OF bird
FOR VALUES IN ('chicken');

CREATE TABLE bird_hawk PARTITION OF bird
FOR VALUES IN ('hawk');

description is not null for around 17% of entries. The table is filled in with a couple million entries.

I want the query

SELECT * FROM bird
WHERE
  species = 'sparrow'
  AND description IS NOT NULL
  AND to_tsvector('english', description) @@ plainto_tsquery('english', $1)
ORDER BY id DESC
LIMIT 10

to run at a reasonable speed. I've attempted creating a btree_gin index:

CREATE INDEX bird_species_description_index ON bird USING GIN(species, to_tsvector(description)) WHERE description IS NOT NULL;

But the query seems to still fall on an inverted pk index scan followed by a filter. However, the queries

SELECT * FROM bird
WHERE
  species = 'sparrow'
  AND description IS NOT NULL
  AND to_tsvector('english', description) @@ plainto_tsquery('english', $1)
ORDER BY id DESC

(does exactly the bitmap index scan on the partial index on the partitioned table you'd expect and is my current hack solution) and

SELECT * FROM bird
WHERE
  description IS NOT NULL
  AND to_tsvector('english', description) @@ plainto_tsquery('english', $1)
ORDER BY id DESC
LIMIT 10

(does the correct bitmap index scan on each partition and then joins everything) are both very performant.

I think postgres is falling onto a very queer index plan because it doesn't know what values are actually common in the GIN index, but I don't know how to solve that.

Postgres version: 14.3.

Configuration: default except for 500mb shared buffers, random page cost set to 1.1 and work_mem increased to 2MB.

Solution

This probably doesn't have anything to do with either partitions or specifically with btree_gin, but is more general problem with the planner. The planner does not use stats collected on partial expressional indexes to get row estimates.

You have already found that storing the tsvector materialized into the table is one way to fix the problem--no longer an expressional index. Another way would to drop the WHERE clause from the index definition, so it is no longer partial. That will make the index somewhat larger as it needs to store ctids for the NULL descriptions, but those compress well, and you never actually need to visit that part of the index with queries of the type you show here.

Also, note that there is no reason to include species into the GIN indexes. The indexes are already partitioned by that field by virtue of the table being partitioned. (If some of your partitions do cover multiple species, then it might make sense to include it--but even that is dubious.)

but I imagine that will only last until the database grows again.

There are all kinds of problems that might possibly show up when the database grows. I see no reason to think this problem is more likely to reappear than some other problem is to appear (or be recognized) for the first time. You could set up a test database and stuff it full of plausible but fictional data, and see what happens.