I have a table with visits to some city by some person on some timestamp:
city_visits:
person_id city timestamp
-----------------------------------------------
1 Paris 2017-01-01 00:00:00
1 Amsterdam 2017-01-03 00:00:00
1 Brussels 2017-01-04 00:00:00
1 London 2017-01-06 00:00:00
2 Berlin 2017-01-01 00:00:00
2 Brussels 2017-01-02 00:00:00
2 Berlin 2017-01-06 00:00:00
2 Hamburg 2017-01-07 00:00:00
Another table lists when a person bought ice cream:
ice_cream_events:
person_id flavour timestamp
-----------------------------------------------
1 Vanilla 2017-01-02 00:12:00
1 Chocolate 2017-01-05 00:18:00
2 Strawberry 2017-01-03 00:09:00
2 Caramel 2017-01-05 00:15:00
For each line in city_visits
table, I need to join the same person's next ice-cream event, along with its timestamp and flavour:
desired_output:
person_id city timestamp ic_flavour ic_timestamp
---------------------------------------------------------------------------
1 Paris 2017-01-01 00:00:00 Vanilla 2017-01-02 00:12:00
1 Amsterdam 2017-01-03 00:00:00 Chocolate 2017-01-05 00:18:00
1 Brussels 2017-01-04 00:00:00 Chocolate 2017-01-05 00:18:00
1 London 2017-01-06 00:00:00 null null
2 Berlin 2017-01-01 00:00:00 Strawberry 2017-01-03 00:09:00
2 Brussels 2017-01-02 00:00:00 Strawberry 2017-01-03 00:09:00
2 Berlin 2017-01-06 00:00:00 null null
2 Hamburg 2017-01-07 00:00:00 null null
I've tried the following:
SELECT DISTINCT ON (cv.person_id, cv.timestamp)
cv.person_id,
cv.city,
cv.timestamp,
ic.flavour as ic_flavour,
ic.timestamp as ic_timestamp
FROM city_visits cv
JOIN ice_cream_events ic
ON ic.person_id = cv.person_id
AND ic.timestamp > cv.timestamp
The DISTINCT ON
clause prevents all but one future ice cream events to be joined for each city visit. It works however it does not automatically select the first one, rather it seems to pick any ice cream event in the future for the same person. Any ORDER BY
clause I can add doesn't seem to change this.
An ideal way of solving that would be to make the DISTINCT ON
clause choose the minimal ic_timestamp
each time he has to filter out duplicates.
Since there is no city
in ice_cream_events
, your query would join to lots of ice-cream events for every visit before picking the earliest one. I suggest LEFT JOIN LATERAL
instead, which will be much faster for this case when backed by an appropriate index:
SELECT *
FROM city_visits cv
LEFT JOIN LATERAL (
SELECT flavour AS ic_flavour, timestamp AS ic_timestamp
FROM ice_cream_events
WHERE person_id = cv.person_id
AND timestamp > cv.timestamp
ORDER BY timestamp
LIMIT 1
) ice ON true
ORDER BY cv.person_id, cv.timestamp;
LEFT [OUTER] JOIN
includes visits without any ice-cream. If you only want visits with ice-cream, switch to CROSS JOIN
.
The outer ORDER BY
only sorts result rows in this case (unlike when combined with DISTINCT ON
, where it also decides which row to pick from each set of peers).
If tables are big, be sure to have appropriate indexes to make it fast. Ideally, a composite index on ice_cream_events (person_id, timestamp, flavour)
- columns in this order. And on city_visits (person_id, timestamp)
for the outer sort. Or maybe even on city_visits (person_id, timestamp, city)
to allow another index-only scan. Depends on your actual situation. The example is obviously symbolic.