Search code examples
postgresqlindexingfull-text-searchrankingjsonb

Index for ranking JSONB search results in PostgreSQL


I'm currently optimizing my search results on jsonb fields of PostgreSQL. I'm using Postgres 9.6. My ultimate goal is to search on multiple fields within my jsonb document and rank the results according to their total hits in all fields. But I'm stuck because the ts_rank function doesn't use my index and slowing down the search immensely. Here is a minimal example:

CREATE TABLE book (
  id   BIGSERIAL NOT NULL,
  data JSONB     NOT NULL
);

CREATE INDEX book_title_idx
  ON book USING GIN (to_tsvector('english', book.data ->> 'title'));

INSERT INTO book (data)
VALUES (CAST('{"title": "Cats"}' AS JSONB));

When trying to search on the title field I'm using this query:

EXPLAIN ANALYZE
SELECT *
FROM (
       SELECT
         id,
         data ->> 'title'              AS title,
         ts_rank(title_query, 'cat:*') AS score
       FROM
         book,
         to_tsvector('english', data ->> 'title') title_query
       WHERE title_query @@ to_tsquery('cat:*')
       ORDER BY score DESC) a
WHERE score > 0
ORDER BY score DESC;

Without ranking the search on my real data takes < 1ms, with the ranking it's ~1800ms. It gets worse the more fields I search on. I need the ranking only to make hits in multiple fields more valueable.


Solution

  • Your query gives the plan (on a test dataset with 500000 rows):

                                                                     QUERY PLAN                                                                  
    ---------------------------------------------------------------------------------------------------------------------------------------------
     Sort  (cost=216058.57..217308.57 rows=500001 width=63) (actual time=831.033..831.033 rows=1 loops=1)
       Sort Key: (ts_rank(title_query.title_query, '''cat'':*'::tsquery)) DESC
       Sort Method: quicksort  Memory: 25kB
       ->  Nested Loop  (cost=0.25..149927.55 rows=500001 width=63) (actual time=4.410..830.950 rows=1 loops=1)
             ->  Seq Scan on book  (cost=0.00..8677.01 rows=500001 width=31) (actual time=0.024..30.159 rows=500001 loops=1)
             ->  Function Scan on to_tsvector title_query  (cost=0.25..0.52 rows=1 width=32) (actual time=0.001..0.001 rows=0 loops=500001)
                   Filter: ((ts_rank(title_query, '''cat'':*'::tsquery) > '0'::double precision) AND (title_query @@ to_tsquery('cat:*'::text)))
                   Rows Removed by Filter: 1
     Planning time: 37.211 ms
     Execution time: 831.279 ms
    (10 rows)
    

    Replace the alias title_query in WHERE clause with the expression used in the index definition:

    EXPLAIN ANALYZE
    SELECT *
    FROM (
           SELECT
             id,
             data ->> 'title'              AS title,
             ts_rank(title_query, 'cat:*') AS score
           FROM
             book,
             to_tsvector('english', data ->> 'title') title_query
           WHERE to_tsvector('english', data ->> 'title') @@ to_tsquery('cat:*')
           ORDER BY score DESC) a
    WHERE score > 0
    ORDER BY score DESC;
    
     Sort  (cost=9905.39..9930.39 rows=10000 width=63) (actual time=1.069..1.069 rows=1 loops=1)
       Sort Key: (ts_rank(title_query.title_query, '''cat'':*'::tsquery)) DESC
       Sort Method: quicksort  Memory: 25kB
       ->  Nested Loop  (cost=114.00..9241.00 rows=10000 width=63) (actual time=1.049..1.050 rows=1 loops=1)
             ->  Bitmap Heap Scan on book  (cost=113.75..8940.75 rows=10000 width=31) (actual time=0.052..0.052 rows=1 loops=1)
                   Recheck Cond: (to_tsvector('english'::regconfig, (data ->> 'title'::text)) @@ to_tsquery('cat:*'::text))
                   Heap Blocks: exact=1
                   ->  Bitmap Index Scan on book_title_idx  (cost=0.00..111.25 rows=10000 width=0) (actual time=0.047..0.047 rows=1 loops=1)
                         Index Cond: (to_tsvector('english'::regconfig, (data ->> 'title'::text)) @@ to_tsquery('cat:*'::text))
             ->  Function Scan on to_tsvector title_query  (cost=0.25..0.27 rows=1 width=32) (actual time=0.994..0.994 rows=1 loops=1)
                   Filter: (ts_rank(title_query, '''cat'':*'::tsquery) > '0'::double precision)
     Planning time: 0.639 ms
     Execution time: 1.120 ms
    (13 rows)