Search code examples
sqlpostgresqldatedatetimegreatest-n-per-group

Select data within one month prior to each user's last record


Assume I have a table called "Diary" like this:

| id | user_id |        recorded_at       | record |
|----|---------|--------------------------|--------|
| 20 |  50245  |2017-10-01 23:00:14.765366|   89   |
| 21 |  50245  |2017-12-05 10:00:33.135331|   97   |
| 22 |  50245  |2017-12-31 11:50:23.965134|   80   |
| 23 |  76766  |2015-10-06 11:00:14.902452|   70   |
| 24 |  76766  |2015-10-07 22:40:59.124553|   81   |

For each user I want to retrieve the latest row and all rows within one month prior to that.

In other words, for user_id 50245, I want the his/her data from "2017-12-01 11:50:23.965134" to "2017-12-31 11:50:23.965134"; for user_id 76766, I want his/her data from "2015-09-07 22:40:59.124553" to "2015-10-07 22:40:59.124553".

Hence the desired result looks like this:

| id | user_id |        recorded_at       | record |
|----|---------|--------------------------|--------|
| 21 |  50245  |2017-12-05 10:00:33.135331|   97   |
| 22 |  50245  |2017-12-31 11:50:23.965134|   80   |
| 23 |  76766  |2015-10-06 11:00:14.902452|   70   |
| 24 |  76766  |2015-10-07 22:40:59.124553|   81   |

Please note that the record of id 20 is not included because it is more than one month prior to user_id 50245's last record.

Is there any way I can write an SQL query to achieve this?


Solution

  • For small tables, any (valid) query technique is good.

    For big tables, details matter. Assuming:

    • There is also a users table with user_id as PK containing all relevant users (or possibly a few more). This is the typical setup.

    • You have (or can create) an index on diary (user_id, recorded_at DESC NULLS LAST). NULLS LAST is optional if recorded_at is defined NOT NULL. But make sure the query matches the index.

    • More than a few rows per user - the typical use case.

    This should be among the fastest options:

    SELECT d.*
    FROM   users u
    CROSS  JOIN LATERAL (
       SELECT recorded_at
       FROM   diary
       WHERE  user_id = u.user_id
       ORDER  BY recorded_at DESC NULLS LAST
       LIMIT 1
       ) d1
    JOIN   diary d ON d.user_id = u.user_id
                  AND d.recorded_at >= d1.recorded_at - interval '1 month'
    ORDER  BY d.user_id, d.recorded_at;
    

    Produces your desired result exactly.

    For only few rows per user, max() or DISTINCT ON () in a subquery are typically faster.

    Related (with detailed explanation):

    About the FROM clause: