Search code examples
sqlpostgresqlindexingpattern-matchingpg-trgm

How to index a column for leading wildcard search and check progress?


My table has 650M rows (according to a fast but decently precise estimate from a query I found here).

It has a text column called receiver_account_id, and I need to be able to search those records like: r.receiver_account_id LIKE '%otherWordsHere'.

Because I'm using a leading wildcard, those searches are impossibly slow. I need an index. And my guess from here is that I need a GIN index.

I ran:

CREATE EXTENSION pg_trgm;
CREATE EXTENSION btree_gin;
CREATE INDEX CONCURRENTLY receipts_receiver_account_id_gin_idx ON public.receipts USING gin (receiver_account_id);

But I'm not sure that the index is even being created.

I ran:

SELECT
  now()::TIME(0),
  a.query,
  p.phase,
  round(p.blocks_done / p.blocks_total::numeric * 100, 2) AS "% done",
  p.blocks_total,
  p.blocks_done,
  p.tuples_total,
  p.tuples_done,
  ai.schemaname,
  ai.relname,
  ai.indexrelname
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON p.pid = a.pid
LEFT JOIN pg_stat_all_indexes ai on ai.relid = p.relid AND ai.indexrelid = p.index_relid;

But I just see <insufficient privilege> (which is bizarre since I own this machine) and a bunch of NULLs.

The next 2 status queries I got from here.

SELECT * FROM pg_class, pg_index WHERE pg_index.indisvalid = false AND pg_index.indexrelid = pg_class.oid; shows:

screenshot1

And then:

SELECT a.datname,
         l.relation::regclass,
         l.transactionid,
         l.mode,
         l.GRANTED,
         a.usename,
         a.query,
         a.query_start,
         age(now(), a.query_start) AS "age",
         a.pid
FROM pg_stat_activity a
JOIN pg_locks l ON l.pid = a.pid
WHERE mode = 'ShareUpdateExclusiveLock'
ORDER BY a.query_start;

Shows: screenshot2

Am I doing this correctly? How can I know when the index creation will finish?


Solution

  • Tailored expression index

    If queries with a leading wildcard are the only (or the only important) kind of queries on that column, then consider an expression index, like @jjanes suggested. It's typically (much) smaller and cheaper to maintain than a trigram index, and faster for fitting queries. (It's far less versatile, though!)

    In modern versions of Postgres I would lean towards a COLLATE "C" index instead of text_pattern_ops, though. See:

    There is no indication in your question but, typically, you want to search case-insensitive. So I add lower() to the expression to arrive at:

    CREATE INDEX CONCURRENTLY receipts_receiver_account_id_rev_idx
    ON public.receipts (lower(reverse(receiver_account_id)) COLLATE "C");
    

    CONCURRENTLY only if you need to avoid an exclusive lock on the table. Else, it's faster without.

    Match the expression in queries:

    ... WHERE lower(reverse(receiver_account_id))
         LIKE lower(reverse('otherWordsHere'   )) || '%' COLLATE "C";
    

    Note how I concatenate the wildcard to the right explicitly. That allows Postgres to use the index even for generic query plans with parameterized 'otherWordsHere'.

    Or, faster yet, use the "starts with" operator ^@ in Postgres 15 or later:

    ... WHERE lower(reverse(receiver_account_id))
           ^@ lower(reverse('otherWordsHere'   )) COLLATE "C";
    

    No wildcard. And no hurdles for generic query plans, either. See:

    The minor downside of using COLLATE "C" instead of text_pattern_ops is that you have to spell out COLLATE "C" in queries to match the index. But while you have to match the "reverse" expression exactly anyway ...

    Related:

    If we know more about the % part and the otherWordsHere part in your pattern '%otherWordsHere', like length or constant bits, we might be able to optimize further.

    Your failed attempt at a trigram index

    You don't need the additional module btree_gin to create a trigram index on a string-type column. Just pg_trgm.
    But you forgot to declare the needed operator class:

    CREATE EXTENSION pg_trgm;
    CREATE INDEX CONCURRENTLY receipts_receiver_account_id_gin_idx ON public.receipts
    USING gin (receiver_account_id gin_trgm_ops);

    You may still want that index to cover a variety of patterns ...

    Escaping special characters in LIKE patterns

    See:

    Tracking progress

    How can I know when the index creation will finish?

    Since Postgres 12, you can consult pg_stat_progress_create_index about progress. See: