Search code examples
sqlpostgresqlperformancesubqueryunion

Execute second SQL query if the first one returns less rows than the limit


I have two SQL queries like this for postgresql:

The first query is fetching subscriptions that were never synced or that were updated recently and synced a long time ago:

SELECT * 
FROM subscriptions
WHERE (last_sync_at IS NULL OR last_sync_at < updated_at) AND updated_at < $1
ORDER BY last_sync_at NULLS FIRST
LIMIT $2`,

The second query is fetching subscriptions, but the one synchronized the longest time ago first.

SELECT *
FROM subscriptions
ORDER BY last_sync_at NULLS FIRST
LIMIT $1`,

I want to aggregate these two queries like this: If the first query returns less results than the limit, then I want to execute the second query to fill the batch.

For example, if the limit is 100 and the first query returns 90, I want to execute the second one with a limit of 10.

I can do this easily with code, but I wonder I there's a way to do that with one SQL query and if it will be more perfomant.

I tried something like this, but obviously it doesn't work:

`(SELECT *, COUNT(*) as nb_rows_selected
    FROM subscriptions
    WHERE (last_chartmogul_sync_at IS NULL OR last_chartmogul_sync_at < updated_at) AND updated_at < $1
    ORDER BY last_chartmogul_sync_at NULLS FIRST
    GROUP BY id
    LIMIT $2)
UNION 
(SELECT * 
    FROM subscriptions
    ORDER BY last_sy NULLS FIRST
    LIMIT nb_rows_selected - $2)`,

Any help would be appreciated. Thanks!


Solution

  • I found the right answer by slightly modifying my first approach. I still use a UNION and a COUNT but this time I've broken the complexity by using the WITH clause:

    WITH unsynced_or_recently_updated_subscriptions AS (
        SELECT *
        FROM subscriptions
        WHERE (last_sync_at IS NULL OR last_sync_at < updated_at) 
        AND updated_at < $1
        ORDER BY last_sync_at NULLS FIRST
        LIMIT $2
    ),
    oldest_synced_subscriptions AS (
        SELECT *
        FROM subscriptions
        WHERE id NOT IN (SELECT id FROM unsynced_or_recently_updated_subscriptions)
        ORDER BY last_sync_at NULLS FIRST
        LIMIT $2 - (SELECT COUNT(*) FROM unsynced_or_recently_updated_subscriptions)
    )
    SELECT *
    FROM unsynced_or_recently_updated_subscriptions
    UNION ALL
    SELECT *
    FROM oldest_synced_subscriptions;
    

    Explanation:

    With this approach the second query is limited to [the number of rows of the first query] - [the limit]. Which is way effective. Because in case the first query fullfiled the limit, the second query will have a limit of 0 and will return immediately.

    Performance Considerations:

    From my understanding, this combined approach is generally more efficient than executing two separate queries, as it reduces the overhead associated with multiple database connections.

    The only scenario where two separate queries might be more performant is when the first query consistently fulfills the limit, in which case the second query is unnecessary. However, this situation is unlikely to occur frequently in my case.

    Also the NOT IN clause should have minimal impact on performance since the id column is indexed.

    EDIT:

    An even better solution could be to use NOT EXISTS instead of NOT IN for the second subquery:

    oldest_synced_subscriptions AS (
        SELECT *
        FROM subscriptions
        WHERE NOT EXISTS (
            SELECT id FROM unsynced_or_recently_updated_subscriptions 
            WHERE unsynced_or_recently_updated_subscriptions.id = subscriptions.id
        )
        ORDER BY last_chartmogul_sync_at NULLS FIRST
        LIMIT $2 - (SELECT COUNT(*) FROM unsynced_or_recently_updated_subscriptions)
    )
    

    The NOT EXISTS clause is generally more efficient than NOT IN, especially when dealing with NULL values. In terms of performance, NOT EXISTS is often faster than NOT IN because it can stop evaluating as soon as it finds a match, whereas NOT IN might have to scan the entire result set of the subquery.