Search code examples
postgresqlsubquerycommon-table-expression

Fetch first X rows from a sorted result in subquery


I want to select only those identifiers from a table which are contained in another subquery result and only top-2 rows must be used.

For example you can run the query on PG:

select a.rid 
from(
    values (41), (42), (43), (44)
    ) a(rid)
where a.rid in
(
select b.x
from(
    values (43), (44), (42), (41)
    ) b(x)
    order by b.x
fetch first 2 rows only
);

It should return

41
42

It works. However if I use the following query with CTE's

select rid 
from(
    values (41), (42), (43), (44)
    ) s(rid)
where rid in 
(
with recursive messages as (
    select
        rid as "root_id",
        parent.id,
        parent.sent_date
    from
        purchase_message parent
    where
        parent.id = rid
union
    select
        rid as "root_id",
        child.id,
        child.sent_date
    from
        purchase_message child
    join messages on
        child.parent_message_id = messages.id 
    ),
ordered_root_ids as (
        select
        m."root_id",
        max(m."sent_date") over (partition by m."root_id") as "maxSentDate"
    from
        messages m
            order by "maxSentDate" desc
    )
    select "root_id" from ordered_root_ids
    fetch first 2 rows only
)

it doesn't work. I get all the results

41, 42, 43, 44

Why doesn't it return only 2 results? According to my expectations it must be only 2 results, and it must be 43 and 44 because they are the latest messages.

I don't know how to provide all the related tables because there are too many. I hope they aren't necessary. Does CTE or recursive CTE or nested query affect this? Or combination of all of them? How to modify the above query so that it sorts data in the subquery, takes 2 top rows and then they are used in the in clause of the outer query?


Solution

  • Does CTE or recursive CTE or nested query affect this? Or combination of all of them?

    It doesn't. There's no way for limit 2 and fetch first 2 rows only to return more than just the two rows. Still, you're getting 2 rows for each value that you're checking, so it's more than enough. This:

    select rid 
    from(
        values (41), (42), (43), (44)
        ) s(rid)
    where rid in 
    (--parenthesized subexpression
    

    Will run the parenthesized subexpression for each rid independently, like a lateral subquery, because you referenced the outer rid inside there. As a result, for each value you check, you're getting up to two matches. I'm guessing you expected to only get the 2 matches once, and that the in() would compare each value to those two.

    You can move your inputs to a CTE, then reference all its rids at once from your subquery, have them all descend down the hierarchy, get sorted together then filtered down together to the top 2. Demo at db<>fiddle:

    with inputs(rid) as (values (41), (42), (43), (44) )
    select rid 
    from inputs join
    (   with recursive messages as (
            select parent.id as "root_id",
                   parent.id,
                   parent.sent_date
            from purchase_message parent
            where parent.id in (select rid from inputs)
            union
            select parent."root_id",
                   child.id,
                   child.sent_date
            from purchase_message child
            join messages parent
              on child.parent_message_id = parent.id 
            )
        select m."root_id" as rid
        from messages m
        group by m."root_id"
        order by max(m."sent_date") desc
        fetch first 2 rows only
    )q using(rid);
    
    rid
    43
    44

    I also had to change your logic that selects the top two root_id based on the most recent sent_date in their hierarchy. Once it's allowed to operate on all input rid's, your code would keep picking 2 of the same rid:

    ordered_root_ids as (
        select
            m."root_id",
            max(m."sent_date") over (partition by m."root_id") as "maxSentDate"
        from
            messages m
        order by "maxSentDate" desc
    )
    select "root_id" from ordered_root_ids
    fetch first 2 rows only
    

    because of your use of the window function, you'd get all rows of each hierarchy, each repeating their max(sent_date). As long as the hierarchy with the most recent sent_date has 2 or more rows in it, you will get those two, because they both have the same max(sent_date) and the same rid.