Search code examples
postgresqlgroup-bysql-likestring-aggregation

LIKE search of joined and concatenated records is really slow (PostgreSQL)


I'm returning a unique list of id's from the users table, where specific columns in a related table (positions) contain a matching string.

The related table may have multiple records for each user record.

The query is taking a really really long time (its not scaleable), so I'm wondering if I'm structuring the query wrong in some fundamental way?

Users Table:

id | name
-----------
1  | frank
2  | kim
3  | jane

Positions Table:

id | user_id | title     | company | description
--------------------------------------------------
1  | 1       | manager   | apple   | 'Managed a team of...'
2  | 1       | assistant | apple   | 'Assisted the...'
3  | 2       | developer | huawei  | 'Build a feature that...'

For example: I want to return the user's id if a related positions record contains "apple" in either the title, company or description columns.

Query:

select
  distinct on (users.id) users.id,
  users.name,
  ...
from users
where (
    select
        string_agg(distinct users.description, ', ') ||
        string_agg(distinct users.title, ', ') ||
        string_agg(distinct users.company, ', ')
    from positions
    where positions.users_id::int = users.id
    group by positions.users_id::int) like '%apple%'

UPDATE

I like the idea of moving this into a join clause. But what I'm looking to do is filter users conditional on below. And I'm not sure how to do both in a join.

1) finding the keyword in title, company, description

or

2) finding the keyword with full-text search in an associated string version of a document in another table.

select
    to_tsvector(string_agg(distinct documents.content, ', '))
from documents
where users.id = documents.user_id
group by documents.user_id) @@ to_tsquery('apple')

So I was originally thinking it might look like,

select
  distinct on (users.id) users.id,
  users.name,
  ...
from users
where (
    (select
        string_agg(distinct users.description, ', ') ||
        string_agg(distinct users.title, ', ') ||
        string_agg(distinct users.company, ', ')
    from positions
    where positions.users_id::int = users.id
    group by positions.users_id::int) like '%apple%')
    or
    (select
        to_tsvector(string_agg(distinct documents.content, ', '))
    from documents
    where users.id = documents.user_id
    group by documents.user_id) @@ to_tsquery('apple'))

But then it was really slow - I can confirm the slowness is from the first condition, not the full-text search.


Solution

  • Might not be the best solution, but a quick option is:

    SELECT  DISTINCT ON ( u.id ) u.id,
            u.name
    FROM    users u
    JOIN    positions p ON (
                     p.user_id = u.id
                AND  ( description || title || company )
                LIKE '%apple%'
            );
    

    Basically got rid of the subquery, unnecessary string_agg usage, grouping on position table etc.

    What it does is doing conditional join and removing duplicate is covered by distinct on.

    PS! I used table aliases u and p to shorten the example

    EDIT: adding also WHERE example as requested

    SELECT  DISTINCT ON ( u.id ) u.id,
            u.name
    FROM    users u
    JOIN    positions p ON ( p.user_id = u.id )
    WHERE   ( p.description || p.title || p.company ) LIKE '%apple%'
    OR      ...your other conditions...;
    

    EDIT2: new details revealed setting new requirements of the original question. So adding new example for updated ask:

    Since you doing lookups to 2 different tables (positions and uploads) with OR condition then simple JOIN wouldn't work. But both lookups are verification type lookups - only looking does %apple% exists, then you do not need to aggregate and group by and convert the data. Using EXISTS that returns TRUE for first match found is what you seem to need anyway. So removing all unnecessary part and using with LIMIT 1 to return positive value if first match found and NULL if not (latter will make EXISTS to become FALSE) will give you same result.

    So here is how you could solve it:

    SELECT  DISTINCT ON ( u.id ) u.id,
            u.name
    FROM    users u
    WHERE   EXISTS (
                SELECT  1
                FROM    positions p
                WHERE   p.users_id = u.id::int
                AND     ( description || title || company ) LIKE '%apple%'
                LIMIT   1
            )
    OR      EXISTS (
                SELECT  1
                FROM    uploads up
                WHERE   up.user_id = u.id::int -- you had here reference to table 'document', but it doesn't exists in your example query, so I just added relation to 'upoads' table as you have in FROM, assuming 'content' column exists there
                AND     up.content LIKE '%apple%'
                LIMIT   1
            );
    

    NB! in your example queries have references to tables/aliases like documents which doesn't reflect anywhere in the FROM part. So either you have cut in your example real query with wrong naming or you have made other way typo is something you need to verify and adjust my example query accordingly.