Search code examples
sqljsonpostgresqlgreatest-n-per-groupdistinct-on

Return only one element per unique ID with only latest joined record


Suppose the following:

create schema bv;
create table bv.user(id bigint primary key);
create table bv.user_photo (
  id bigint primary key,
  url varchar(255) not null,
  user_id bigint references bv.user(id)
);

insert into bv.user values (100), (101);
insert into bv.user_photo values
  (1, 'https://1.com', 100),
  (3, 'https://3.com', 100),
  (4, 'https://4.com', 101),
  (2, 'https://2.com', 100);

I'd like to query for and build an object for every user, and include only the latest image in the result.

Here's what I have:

select
  json_build_object(
    'id', u.id,
    'latest_image', up.url
  ) user
from bv.user u
left join bv.user_photo up
  on u.id = up.user_id

However this returns:

[
  {"id" : 100, "url" : "https://2.com"},
  {"id" : 100, "url" : "https://3.com"},
  {"id" : 100, "url" : "https://1.com"},
  {"id" : 101, "url" : "https://4.com"}
]

However, the expected result is:

[
  {"id" : 100, "url" : "https://3.com"},
  {"id" : 101, "url" : "https://4.com"}
]

I've tried using distinct:

select distinct on(u.id)
  json_build_object(
    'id', u.id,
    'url', up.url
  ) user
from bv.user u
left join bv.user_photo up
  on u.id = up.user_id
order by u.id, up.id DESC

But my question is whether or not this is the correct approach? I feel like I shouldn't be using distinct in such a situation.


Solution

  • With few photos per user, and while you return all (or most) users, DISTINCT ON is the best approach.
    But it's typically faster to get distinct photos before you join:

    SELECT json_build_object('id', u.id, 'url', p.url) AS "user"
    FROM   users u
    LEFT   JOIN (
       SELECT DISTINCT ON (user_id)
              user_id, url
       FROM   user_photo
       ORDER  BY user_id, id DESC
       ) p ON p.user_id = u.id;
    

    For many photos per user, an emulated index-skip scan is (much) faster. See:

    Summary array

    To produce one summary array, you can skip json_build_object(). json_agg() can aggregate the row directly:

    SELECT json_agg(sub) AS "users"
    FROM  (
       SELECT u.id, p.url
       FROM   users u
       LEFT   JOIN (
          SELECT DISTINCT ON (user_id)
                 user_id, url
          FROM   user_photo
          ORDER  BY user_id, id DESC
          ) p ON p.user_id = u.id
       ) sub;
    

    Notably, all queries so far include the key "url" with a null value where no photo is found. You may want to strip the noise:

    SELECT json_strip_nulls(json_agg(sub)) AS "users"
    FROM  ...
    

    For a small selection of users

    This seems to be your use case.

    A LATERAL subquery is typically faster. Also deals with many photos per user efficiently!

    SELECT json_build_object('id', u.id, 'url', p.url) AS "user"
    FROM   users u
    LEFT   JOIN LATERAL (
       SELECT up.url
       FROM   user_photo up
       WHERE  up.user_id = u.id
       ORDER  BY up.id DESC
       LIMIT  1
       ) p ON true
    WHERE  u.id IN (100, 101);  -- small selection
    

    Summary array for a small selection, skipping null values

    SELECT json_strip_nulls(json_agg(sub)) AS "users"
    FROM  (
       SELECT u.id, p.url
       FROM   users u
       LEFT   JOIN LATERAL (
          SELECT up.url
          FROM   user_photo up
          WHERE  up.user_id = u.id
          ORDER  BY up.id DESC
          LIMIT  1
          ) p ON true
       WHERE  u.id IN (100, 101)  -- small selection
       ) sub;
    

    fiddle

    Index

    DISTINCT ON does not need an index. All other queries absolutely need an index on user_photo(user_id, id). Or even, ideally:

    CREATE INDEX ON user_photo (user_id, id DESC) INCLUDE (url);
    

    Aside: Don't use the reserved word "user" as identifier. It works while schema-qualified, but fails without.