Search code examples
jsonpostgresqljsonbpostgresql-performance

Query on json / jsonb column super slow. Can I use an index?


I am trying to speed up the querying of some json data stored inside a PostgreSQL database. I inherited an application that queries a PostgreSQL table called data with a field called value where value is blob of json of type jsonb.

It is about 300 rows but takes 12 seconds to select this data from the 5 json elements. The json blobs are a bit large but the data I need is all in the top level of the json nesting if that helps.

I tried adding an index of CREATE INDEX idx_tbl_data ON data USING gin (value); but that didn't help. Is there a different index I should be using? The long term vision is to re-write the application to move that data out of the json but that is something is at least 30-40 man days of work due to complexity in other parts of the application so I am looking to see if I can make this faster in short term.

Not sure if it helps but the underlying data that makes up this result set doesn't change often. It's the data that is further down in the json blob that often changes.

SELECT
  value::json ->> 'name' AS name,
  value::json ->> 'mnemonic' AS mnemonic,
  value::json ->> 'urlName' AS "urlName",
  value::json ->> 'countryCode' AS "countryCode",
  value::json #>>'{team}' AS team
FROM
  data;

Solution

  • Like you mentioned yourself, the proper fix is to extract those attributes to separate columns, somewhat normalizing your design.

    Can an index help?

    Sadly, no (as of Postgres 14).

    It could work in theory. Since your values are big, an expression index with just some small attributes can be picked up by Postgres in an index-only scan, even when retrieving all rows (where it otherwise would ignore indexes).

    The manual:

    However, PostgreSQL's planner is currently not very smart about such cases. It considers a query to be potentially executable by index-only scan only when all columns needed by the query are available from the index.

    So you would have to include value itself in the index, even if just as INCLUDE column - totally spoiling the whole idea. No go.

    You probably can still do something in the short term. Two crucial quotes:

    I am looking to see if I can make this faster in short term

    The json blobs are a bit large

    Data type

    Drop the cast to json from the query. Casting every time adds pointless cost.

    Compression

    One major cost factor will be compression. Postgres has to "de-toast" the whole large column, just to extract some small attributes. Since Postgres 14, you can switch the compression algorithm (if support is enabled in your version!). The default is defined by the config setting default_toast_compression, which is pglz by default. Currently the only available alternative is lz4. You can set that per column. Any time.

    LZ4 (lz4) is considerably faster, while compressing typically a bit less. About twice as fast, but around 10 % more storage (depends!). If performance is not an issue it's best to stick to the stronger compression of the default LZ algorithm (pglz). There may be more compression algorithms to pick from in the future.

    To implement:

    ALTER TABLE data
      ALTER COLUMN value SET COMPRESSION lz4;
    

    Setting a new COMPRESSION for a column does not re-compress automatically. Postgres remembers the compression method and only re-compresses if it's forced to un-compress in any way. You may want to force re-compression of existing values. You can check with:

    SELECT pg_column_compression(value) FROM data LIMIT 10;
    

    Related blog post:

    GENERATED columns

    While stuck with the broken design you might just add some (small!) generated columns to cover your query:

    ALTER TABLE data
      ADD COLUMN name text GENERATED ALWAYS AS (value::json ->> 'name') STORED
    , ADD COLUMN mnemonic text GENERATED ALWAYS AS (value::json ->> 'mnemonic') STORED
    ...
    

    And then target only those generated columns, not involving the big value at all.

    SELECT name, mnemonic, ... FROM data;
    

    That would bypass the main performance issue.

    See:

    However, you mentioned:

    It's the data that is further down in the json blob that often changes.

    Every change to value forces a re-check on the generated columns, so that adds write cost.