sql postgresql left-join greatest-n-per-group postgresql-14

How to include columns from two tables based on a common column value but include rows only from the left table?

I have two tables tb1 and tb2. tb2 is connected to tb1 via a foreign key named tk_id. Here is how my two tables look like

tb1

     Column      |            Type             | Collation | Nullable |                     Default                      
-----------------+-----------------------------+-----------+----------+--------------------------------------------------
 id              | integer                     |           | not null | nextval('tb2_id_seq'::regclass)
 created_at      | timestamp without time zone |           | not null | 
 modified_at     | timestamp without time zone |           | not null | 
 status          | double precision            |           | not null | 
 tk_id           | uuid                        |           | not null |

tb2

     Column     |            Type             | Collation | Nullable | Default 
----------------+-----------------------------+-----------+----------+---------
 id             | uuid                        |           | not null | 
 created_at     | timestamp without time zone |           | not null | 
 modified_at    | timestamp without time zone |           | not null | 
 destination_id | uuid                        |           | not null | 
 source_id      | uuid                        |           | not null | 
 tk_id          | uuid                        |           | not null |

Now I need to get all rows from tb1 which has columns from both tb1 and tb2 when the tk_id value matches for both the rows.

This is what I tried:

select tb1.created_at, tb1.status, tb2.source_id, tb2.destination_id from tb1 
inner join tb2 on tb1.tk_id = tb2.tk_id where 
tb1.created_at > timezone('utc', now()) - interval '40 minutes';

But I am getting way too many rows. Usually in a 40min interval, there would be around 800 records but after the join, I am getting around 100,000+ records.

EDIT: After some reading and a few tries, I made some changes in my query and managed to get the rows down to the expected number of rows. This is my query now

SELECT count(*) FROM tb1 LEFT OUTER JOIN (SELECT DISTINCT tk_id FROM tb2) t2 
ON tb1.tk_id = t2.tk_id where tb1.created_at > timezone('utc', now()) - 
interval '40 minutes';

But now I can't get the columns of tb2 in my select query.

What am I doing wrong?

EDIT2: Sorry if I couldn't make this clearer earlier. The join condition should be based on the latest occurrence of tk_id of the right table. So for every row of left table (tb1), it should match against the latest occurrence of tk_id of right table (tb2) and fetch the right table's columns.

Solution

Since there are obviously many rows in tb2 for a single row in tb1, you need to define which row to pick. Or some aggregation?

This query returns all qualifying rows from tb1, and adds columns from the one matching row in tb2 that was created last (if any):

SELECT tb1.created_at, tb1.status, tb2.*
FROM   tb1
LEFT   JOIN LATERAL (
   SELECT tb2.source_id, tb2.destination_id
   FROM   tb2
   WHERE  tb2.tk_id = tb1.tk_id
   ORDER  BY created_at DESC, id DESC
   LIMIT  1
   ) tb2 ON true
WHERE  tb1.created_at > timezone('utc', now()) - interval '40 minutes';

Added id DESC as tiebreaker since created_at might not be unique.
Adapt to your undisclosed needs.

Support this query with an index on tb2(tk_id, created_at, id).
And another index on tb1(created_at), obviously. Or even a covering index on tb1(created_at) INCLUDE (tk_id, status). See: