I have a character varying
column called message
in a table messages
that stores user messages from an IRC channel. With a chat bot, I allow users to search how many times a certain term has been typed. This term can be anything: one character, one word or multiple words. The table has ~15 million rows and query times can be pretty long.
I use the following query to find all substrings matching term
case-insensitive:
select sum(array_length(string_to_array(LOWER(message), LOWER('term')), 1) -1) from messages;
The query uses sequential scan. When I set enable_seqscan = off;
it uses the btree
index that I also have on the table. The table also has a trigram index, but it never gets used.
What angle would you be looking at to improve query performance?
I use Postgres 14.9
Can you use postgreSQL optimization for this application? Yes, but not the way you have organized it.
Start by counting messages
rows where the message
column contains an arbitrary user-furnished search term. You can do so with this query.
SELECT COUNT(*) FROM messages WHERE message ILIKE '%term%'
Then, create a so-called trigram index on that column you want to search. You'd do that like so.
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX CONCURRENTLY message_text ON messages
USING GIN (message gin_trgm_ops);
This index setup, uniquely to postgreSQL, accelerates the LIKE
and ILIKE
predicates.
This technique returns the number of messages containing one or more occurences of the search term, whereas your requirement calls for returning the total number of occurences of the search term. To get the total number you can use a subquery to filter in just the messages with any occurences of the term, then count those. That will be faster than searching all the messages.
select sum(array_length(string_to_array(LOWER(message), LOWER('term')), 1) -1)
from (
SELECT message FROM messages WHERE message ILIKE '%term%'
) subset
If this were my app, before putting it into production I would disallow searching on terms shorter than three or four letters, and I might even create a table of disallowed stop words. That's because these queries will slow down when they return large counts. Somebody could use a short search term like 'e'
to deny service to your users.