I have the following table structure:
Item
ID | Name
--------
1 | Apple
2 | Pear
3 | Banana
4 | Plum
5 | Tomato
Event
ItemStart | ItemEnd | EventType | EventDate
--------------------------------------------
1 | 2 | Planted | 2014-01-01
1 | 3 | Picked | 2014-01-02
3 | 5 | Eaten | 2014-01-05
The two tables are linked only by the primary key of Item and the range of ItemStart and ItemEnd (inclusive) in Event. Events always refer to contiguous sequences of Items, but not all the Events for a given Item will have the same range. Events never occur on the same date for a given Item.
The query I'd like to produce is as follows:
List all the Items, and for each Item show the most recent Event
Sample output:
ID | Name | Event | Date
----------------------------
1 | Apple | Picked | 2014-01-02 (Planted then Picked)
2 | Pear | Picked | 2014-01-02 (Planted then Picked)
3 | Banana | Eaten | 2014-01-05 (Picked then Eaten)
4 | Plum | Eaten | 2014-01-05 (Eaten)
5 | Tomato | Eaten | 2014-01-05 (Eaten)
This seems reasonable enough on the face of it, and if there were traditional foreign-key relationships in place (imagine ItemID
instead of ItemStart
and ItemEnd
) I'd probably do a join to a correlated subquery something like this:
SELECT Name, EventType, EventDate
FROM Item i
INNER JOIN (
SELECT ItemID, EventType, EventDate
FROM Event e
WHERE EventDate = (SELECT MAX(EventDate) FROM Event e_max WHERE e_max.ItemID = e.ItemID)
) latest_events ON i.ID = latest_events.ItemID
However, with the range relationship in place I'm stuck, I want to do something more like this but it doesn't work:
SELECT Name, EventType, EventDate
FROM Item i
INNER JOIN (
SELECT ItemStart, ItemEnd, EventType, EventDate
FROM Event e
WHERE EventDate = (SELECT MAX(EventDate) FROM Event e_max WHERE i.ID >= e_max.ItemStart AND i.ID <= e_max.ItemEnd)
) latest_events ON i.ID >= latest_events.ItemStart AND i.ID <= latest_events.ItemEnd
I get an error about i.ID >= e_max.ItemStart AND i.ID <= e_max.ItemEnd
on line 6, because you can't reference i
from within another part of the join. I wanted to do that (which isn't necessary in the simpler example) because when I'm constructing the subquery I no longer have a single ID to link to - the overlapping ranges mean that there are many possible ways of including a single Item, and so I want to refer directly back to that item, whose ID is only available in the top-level Item table.
I hope that makes sense.
I'm using SQL Server 2008 R2. This is for a report which will run overnight, so speed is not as important is it could be, but there are very many Items (100s of million); while there are multiple Events against each Item, the use of large ranges means there are much much fewer Event records.
Things I've thought about:
How can I produce this query? Thanks in advance!
You can do this utilizing a CTE
and row_number()
.
;with cte as
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY i.id ORDER BY e.EventDate DESC) as rNum
FROM Item i
JOIN Event e
ON i.id between e.ItemStart and e.ItemEnd
)
SELECT ID,
Name,
EventType,
EventDate FROM cte
WHERE rNum = 1
Basically the CTE has joined item and event and added a new column for rownumber and is partitioned on item.ID. Here's a screenshot of what it looks like. From here I just select rNum = 1 which should be the max event date for each item.id.