Search code examples
sqlgroup-bygreatest-n-per-group

SQL group by selecting top rows with possible nulls


The example table:

id name create_time group_id
1 a 2022-01-01 12:00:00 group1
2 b 2022-01-01 13:00:00 group1
3 c 2022-01-01 12:00:00 NULL
4 d 2022-01-01 13:00:00 NULL
5 e NULL group2

I need to get top 1 rows (with the minimal create_time) grouped by group_id with these conditions:

  1. create_time can be null - it should be treated as a minimal value
  2. group_id can be null - all rows with nullable group_id should be returned (if it's not possible, we can use coalesce(group_id, id) or sth like that assuming that ids are unique and never collide with group ids)
  3. it should be possible to apply pagination on the query (so join can be a problem)
  4. the query should be universal as much as possible (so no vendor-specific things). Again, if it's not possible, it should work in MySQL 5&8, PostgreSQL 9+ and H2

The expected output for the example:

id name create_time group_id
1 a 2022-01-01 12:00:00 group1
3 c 2022-01-01 12:00:00 NULL
4 d 2022-01-01 13:00:00 NULL
5 e NULL group2

I've already read similar questions on SO but 90% of answers are with specific keywords (numerous answers with PARTITION BY like https://stackoverflow.com/a/6841644/5572007) and others don't honor null values in the group condition columns and probably pagination (like https://stackoverflow.com/a/14346780/5572007).


Solution

  • You can combine two queries with UNION ALL. E.g.:

    select id, name, create_time, group_id
    from mytable
    where group_id is not null
    and not exists
    (
      select null
      from mytable older
      where older.group_id = mytable.group_id
      and older.create_time < mytable.create_time  
    )
    union all
    select id, name, create_time, group_id
    from mytable
    where group_id is null
    order by id;
    

    This is standard SQL and very basic at that. It should work in about every RDBMS.

    As to pagination: This is usually costly, as you run the same query again and again in order to always pick the "next" part of the result, instead of running the query only once. The best approach is usually to use the primary key to get to the next part so an index on the key can be used. In above query we'd ideally add where id > :last_biggest_id to the queries and limit the result, which would be fetch next <n> rows only in standard SQL. Everytime we run the query, we use the last read ID as :last_biggest_id, so we read on from there.

    Variables, however, are dealt with differently in the various DBMS; most commonly they are preceded by either a colon, a dollar sign or an at sign. And the standard fetch clause, too, is supported by only some DBMS, while others have a LIMIT or TOP clause instead.

    If these little differences make it impossible to apply them, then you must find a workaround. For the variable this can be a one-row-table holding the last read maximum ID. For the fetch clause this can mean you simply fetch as many rows as you need and stop there. Of course this isn't ideal, as the DBMS doesn't know then that you only need the next n rows and cannot optimize the execution plan accordingly.

    And then there is the option not to do the pagination in the DBMS, but read the complete result into your app and handle pagination there (which then becomes a mere display thing and allocates a lot of memory of course).