Search code examples
sqlpostgresqlsql-order-bydistinctgreatest-n-per-group

PostgreSQL remove duplicates by GROUP BY


I would like to print the last message of a person, but only his latest message should be printed per person. I use PostgreSQL 10.

+-----------+----------+--------------+
| name      |   body   |  created_at  |
+-----------+----------+--------------+
| Maria     | Test3    |  2017-07-07  |
| Paul      | Test5    |  2017-06-01  |
+-----------+----------+--------------+

I have tried this with the following SQL query, this gives me exactly that back but unfortunately the people are doubled in it.

SELECT * FROM messages 
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
+-----------+----------+--------------+
| name      |   body   |  created_at  |
+-----------+----------+--------------+
| Maria     | Test1    |  2016-06-01  |
| Maria     | Test2    |  2016-11-01  |
| Maria     | Test3    |  2017-07-07  |
| Paul      | Test4    |  2017-01-01  |
| Paul      | Test5    |  2017-06-01  |
+-----------+----------+--------------+

I tried to remove the duplicates with a DISTINCT, but unfortunately I get this error message:

SELECT DISTINCT ON (name) * FROM messages 
WHERE receive = 't'
GROUP BY name
ORDER BY MAX(created_at) DESC
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions LINE 1: SELECT DISTINCT ON (name) * FROM messages ^ : SELECT DISTINCT ON (name) * FROM messages WHERE receive = 't' GROUP BY name ORDER BY MAX(created_at) DESC

Do you have any ideas how I can solve this ?


Solution

  • You would use DISTINCT ON as follows:

    SELECT DISTINCT ON (name) * 
    FROM messages 
    WHERE receive = 't'
    ORDER BY name, created_at DESC
    

    That is:

    • no GROUP BY clause is needed

    • the column(s) listed in DISTINCT ON(...) must appear first in the ORDER BY clause

    • ... followed by the column that should be use to break the group (here, that is created_at)

    Note that the results of a distinct on query are always sorted by columns in the clause (because this sort is what is used to identifiy which rows should be kept).

    If you want more control over the sort order, then you can use window functions instead:

    SELECT *
    FROM (
        SELECT m.*, ROW_NUMBER() OVER(PARTITION BY name ORDER BY created_at DESC) rn
        FROM messages m
        WHERE receive = 't'
    ) t
    WHERE rn = 1
    ORDER BY created_at DESC