Search code examples
arrayspostgresqlindexingfull-text-searchjsonb

Index for searching nested JSONB array elements in PostgreSQL


I'm fairly new to postgres and currently using 9.6. When trying to implement a full text search in postgres using it's jsonb documents I noticed slow search results for nested arrays. I used the 'explain' command and it didn't use any indices. For simplicity purpose I created a table to investigate:

CREATE TABLE book (
  id   BIGSERIAL NOT NULL,
  data JSONB     NOT NULL
);

My available indices:

CREATE INDEX book_author_idx
  ON book USING GIN (to_tsvector('english', book.data ->> 'author'));
CREATE INDEX book_author_name_idx
  ON book USING GIN (to_tsvector('english', book.data -> 'author' ->> 'name'));

And some data to fill a document:

INSERT INTO book (data)
VALUES (CAST('{"author": [{"id": 0, "name": "Cats"}, ' ||
             '           {"id": 1, "name": "Dogs"}]}' AS JSONB));

I'm able to search for book elements using the following query, however it doesn't use any index. With my actual data of 120k products it takes around 1200ms while other searches with an index take 0.2ms.

EXPLAIN ANALYZE
SELECT
  id,
  data ->> 'author' AS author
FROM book, jsonb_array_elements(data #> '{author}') author_array
WHERE to_tsvector('english', author_array ->> 'name') @@ to_tsquery('cat');

In contrast the next query uses the book_author_name_idx but because of the array structure doesn't find anything.

EXPLAIN ANALYZE
SELECT
  id,
  data ->> 'author' AS author
FROM book
WHERE to_tsvector('english', data -> 'author' ->> 'name') @@ to_tsquery('cat');

How can I tweak my query to use a language index? I'm aware, that I could make a new table for authors and only reference the ids, but I'd rather keep all data in one table for performance.


Solution

  • With the tips from posz comments I found a solution. Because the '||' function doesn't work the way I need it, I used a custom concat function for tsvectors. I used the code from glittershark on github and changed to_tsvector from 'default' to 'english' to fit my needs.

    CREATE OR REPLACE FUNCTION concat_tsvectors(tsv1 TSVECTOR, tsv2 TSVECTOR)
      RETURNS TSVECTOR AS $$
    BEGIN
      RETURN coalesce(tsv1, to_tsvector('english', ''))
             || coalesce(tsv2, to_tsvector('english', ''));
    END;
    $$ LANGUAGE plpgsql;
    
    CREATE AGGREGATE tsvector_agg (
    BASETYPE = TSVECTOR,
    SFUNC = concat_tsvectors,
    STYPE = TSVECTOR,
    INITCOND = ''
    );
    

    Here is the custom function I wrote. Input is data as JSONB and output is a tsvector with aggregated author names.

    CREATE OR REPLACE FUNCTION author_function(
      IN  data        JSONB,
      OUT resultNames TSVECTOR
    )
      RETURNS TSVECTOR AS $$
    DECLARE
      authorRecords   RECORD;
      combinedAuthors JSONB [];
      singleAuthor    JSONB;
    BEGIN
      FOR authorRecords IN (SELECT value
                            FROM jsonb_array_elements(data #> '{author}'))
      LOOP
        combinedAuthors := combinedAuthors || authorRecords.value;
      END LOOP;
      FOREACH singleAuthor IN ARRAY coalesce(combinedAuthors, '{}')
      LOOP
        resultNames := concat_tsvectors(resultNames, to_tsvector('english', singleAuthor ->> 'name'));
      END LOOP;
    END; $$
    LANGUAGE plpgsql
    IMMUTABLE;
    

    Then I setup an index for my book objects.

    CREATE INDEX book_author_function_idx
      ON book USING GIN (author_function(book.data));
    

    The author names already went through the to_tsvector('english', singleAuthor) function, so I can query for them like this:

    EXPLAIN ANALYSE
    SELECT
      id,
      data ->> 'author' AS author
    FROM book
    WHERE author_function(book.data) @@ to_tsquery('cat');
    

    As a result queries for my actual data went from 1100-1200ms to ~0.5ms. I'm not sure if this is the best solution, so if you have better suggestions, please let me know.