sql postgresql greatest-n-per-group distinct-on

Select first event after a timestamp per row in another table in PostgreSQL

I have a table with visits to some city by some person on some timestamp:

city_visits:

person_id         city                timestamp
-----------------------------------------------
        1        Paris      2017-01-01 00:00:00
        1    Amsterdam      2017-01-03 00:00:00
        1     Brussels      2017-01-04 00:00:00
        1       London      2017-01-06 00:00:00
        2       Berlin      2017-01-01 00:00:00
        2     Brussels      2017-01-02 00:00:00
        2       Berlin      2017-01-06 00:00:00
        2      Hamburg      2017-01-07 00:00:00

Another table lists when a person bought ice cream:

ice_cream_events:

person_id      flavour                timestamp
-----------------------------------------------
        1      Vanilla      2017-01-02 00:12:00
        1    Chocolate      2017-01-05 00:18:00
        2   Strawberry      2017-01-03 00:09:00
        2      Caramel      2017-01-05 00:15:00

For each line in city_visits table, I need to join the same person's next ice-cream event, along with its timestamp and flavour:

desired_output:

person_id       city            timestamp  ic_flavour          ic_timestamp
---------------------------------------------------------------------------
        1      Paris  2017-01-01 00:00:00     Vanilla   2017-01-02 00:12:00
        1  Amsterdam  2017-01-03 00:00:00   Chocolate   2017-01-05 00:18:00
        1   Brussels  2017-01-04 00:00:00   Chocolate   2017-01-05 00:18:00
        1     London  2017-01-06 00:00:00        null                  null
        2     Berlin  2017-01-01 00:00:00  Strawberry   2017-01-03 00:09:00
        2   Brussels  2017-01-02 00:00:00  Strawberry   2017-01-03 00:09:00
        2     Berlin  2017-01-06 00:00:00        null                  null
        2    Hamburg  2017-01-07 00:00:00        null                  null

I've tried the following:

SELECT DISTINCT ON (cv.person_id, cv.timestamp)
  cv.person_id,
  cv.city,
  cv.timestamp,
  ic.flavour as ic_flavour,
  ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
    ON ic.person_id = cv.person_id
   AND ic.timestamp > cv.timestamp

The DISTINCT ON clause prevents all but one future ice cream events to be joined for each city visit. It works however it does not automatically select the first one, rather it seems to pick any ice cream event in the future for the same person. Any ORDER BY clause I can add doesn't seem to change this.

An ideal way of solving that would be to make the DISTINCT ON clause choose the minimal ic_timestamp each time he has to filter out duplicates.

Solution

Since there is no city in ice_cream_events, your query would join to lots of ice-cream events for every visit before picking the earliest one. I suggest LEFT JOIN LATERAL instead, which will be much faster for this case when backed by an appropriate index:

SELECT *
FROM   city_visits cv
LEFT   JOIN LATERAL (
   SELECT flavour AS ic_flavour, timestamp AS ic_timestamp
   FROM   ice_cream_events 
   WHERE  person_id = cv.person_id
   AND    timestamp > cv.timestamp
   ORDER  BY timestamp
   LIMIT  1
   ) ice ON true
ORDER  BY cv.person_id, cv.timestamp;

LEFT [OUTER] JOIN includes visits without any ice-cream. If you only want visits with ice-cream, switch to CROSS JOIN.

JOIN (select ...) ue ON 1=1?

The outer ORDER BY only sorts result rows in this case (unlike when combined with DISTINCT ON, where it also decides which row to pick from each set of peers).

Select first row in each GROUP BY group?

If tables are big, be sure to have appropriate indexes to make it fast. Ideally, a composite index on ice_cream_events (person_id, timestamp, flavour) - columns in this order. And on city_visits (person_id, timestamp) for the outer sort. Or maybe even on city_visits (person_id, timestamp, city) to allow another index-only scan. Depends on your actual situation. The example is obviously symbolic.

Optimize GROUP BY query to retrieve latest record per user