Search code examples
postgresqlindexingdatabase-designjsonbpostgresql-performance

Indexing a JSONB key in combination with other columns


For searching on specific keys within a jsonb column, I wanted to create an index on the column.

Using: Postgres 10.2

Ignoring some unrelated columns, I have the table animals with these columns (omitting some unrelated columns):

animalid PK number
location (text)
type (text)
name (text)
data (jsonb) for eg: {"age": 2, "tagid": 11 }

I would need to search based on: location, type and tagId. Like:

where location = ? and type = 'cat' and (data ->> 'tagid') = ?

Other important points:

  • Only animals of type cats will be having a tag id, this is a new animal type being added now.
  • The number of "cats" in the whole table would be less compared to the other types of animals.
  • The table is huge, with millions of rows - and partitioned.

How do I make sure that the search is fast? Options I have considered:

  1. Make a separate table cats to store: animal_id, location, tagId (although FK to the partitioned parent table is not possible)
  2. Create an index on location, type and jsonb key.
  3. Create a new (indexed) column tagId - which would be null for all animals but cat.

I do have an index on other columns on the table - but a bit confused on how to create an index to make the search for cats based on the tagid fast. Any suggestions?

UPDATE (Ignoring partitions):

(Testing on a partitioned table)

So I decided to go with the option as suggested by Erwin and tried creating an index

CREATE INDEX ON animals_211 (location, ((data->>'tagid')::uuid)) WHERE  type = 'cat';

And tried an EXPLAIN on the query (using the partitioned table to keep it simple):

explain select * from animals_211 a
where a.location  = 32341
and a.type  = 'cat'
and (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c'

And from the results, it looks like it doesn't use the index created and does a sequential scan:

Seq Scan on animals_211  e  (cost=0.00..121.70 rows=1 width=327)                                                                                                        |
  Filter: ((location = 32341) AND ((type)::text = 'cat'::text) AND (((data ->> 'tagid'::text))::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c'::uuid

UPDATE 2 (not using partial index)

It somehow seems to be the partial index, as without it - it seems to work:

CREATE INDEX tag_id_index ON animals_211 (location, type, ((data->>'tagid')::uuid))

When I do an explain plan:

Index Scan using tag_id_index on animals_211 e  (cost=0.28..8.30 rows=1 width=327)                                                                                         
  Index Cond: ((location = 32341) AND ((type)::text = 'cat'::text) AND (((data ->> 'tagid'::text))::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c'::uuid))

Solution

  • Basics (ignoring partitioning)

    Based on your three "important points" I suggest a partial index on an expression:

    CREATE INDEX ON animals ((data->>'tagid'))
    WHERE  type = 'cat';
    

    Use CREATE INDEX CONCURRENTLY ... to avoid locking issues with concurrent write access to the same table.

    Postgres also gathers specific statistics for the partial index, which help the query planner to get appropriate estimates. Note that you need to run ANALYZE (or VACUUM ANALYZE) manually if you test the index right after creation before autovacuum can kick in. See:

    If tagid is really some other data type than text, you can also cast the expression to optimize some more. See:

    Your update suggests tagid stores UUID values. Read:

    So consider this index instead:

    CREATE INDEX ON animals (((data->>'tagid')::uuid))  -- !
    WHERE  type = 'cat';
    

    The extra set of parentheses around (data->>'tagid')::uuid is required to make the syntax unambiguous.
    And a matching query:

    SELECT *
    FROM   animals
    WHERE  location = 32341
    AND    type = 'cats'
    AND    (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c';  -- !
    

    Or - depending on the selectivity of each predicate and which variants of the query are possible - include location to make it a multicolumn index:

    CREATE INDEX ON animals (location, ((data->>'tagid')::uuid))
    WHERE  type = 'cat';
    

    Or tagid first if you have queries without filtering on location. See:

    Since only relatively few rows are of type 'cat', the index will be relatively small, excluding the bulk of your "millions of rows". And we only ever need the index on tagid for cats to begin with. Win-win.

    If possible break out the json key data->>'tagid' as dedicated column. (Like you considered as option 3.) Can be null where it's not applicable, null storage is very cheap. Makes storage and index cheaper, yet, and queries a bit simpler.

    Partitioning

    Postgres 10 does not support indexes on the parent table of a partitioned table. This was added in Postgres 11. Declarative partitioning has been improved a lot since. Consider upgrading to the current version 13 or later.

    There is also the option of "old-style" partitioning with inheritance. Then you can have a separate partition for cats with an additional column tagid only there. The manual:

    For declarative partitioning, partitions must have exactly the same set of columns as the partitioned table, whereas with table inheritance, child tables may have extra columns not present in the parent.

    Sounds like a perfect fit. But inheritance has fallen out of favor with Postgres, so I would think twice before doing that.

    Either way - be it declarative or inheritance - if you have all "cats" in a separate partition a non-partial index does the job, obviously:

    CREATE INDEX ON cats (location, ((data->>'tagid')::uuid));
    

    And the query can target the partition cats instead of the parent table:

    SELECT *
    FROM   cats
    WHERE  location = 32341
    AND    (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c';
    

    Targeting the parent table should work, too. (Not sure about Postgres 10.)

    SELECT *
    FROM   animals
    WHERE  type = 'cat'
    AND    location = 32341
    AND    (data->>'tagid')::uuid = '5e54c1d9-3ea0-4bca-81d6-1000d90cc42c';
    

    But activate partition pruning for that. The manual:

    Note that partition pruning is driven only by the constraints defined implicitly by the partition keys, not by the presence of indexes. Therefore it isn't necessary to define indexes on the key columns.

    All other partitions should be pruned, and then you should get an index scan on the cats partition only ...