Search code examples
sqlsql-servergtfssql-timestamp

SQL Server max timestamp per day for specific column from joined table


I'm really new to SQL and have essentially googled my way to this point, I'm pretty stuck now.. so I hope someone out there can help!

Goal: get the rows with the most recent timestamp per day for every trip_id in MS SQL Server (there are multiple tables that need to be joined to get the data needed).

So each trip_id should have 1 row per day, like so… [removed some columns for readability]

timestamp,trip_id,stop_id,stop_code,arrival_time,departure_delay
4/28/2017 18:29,8888922,2847,52818,11:02:34,0
4/27/2017 18:26,8888922,2847,52818,11:02:34,60
4/25/2017 18:27,8888922,2847,52818,11:02:34,-120
4/28/2017 18:56,8888922,2847,52818,11:32:34,-60
4/25/2017 18:59,8888922,2847,52818,11:32:34,120
4/28/2017 19:34,8888922,2847,52818,12:02:34,360
4/27/2017 19:31,8888922,2847,52818,12:02:34,540
4/25/2017 19:27,8888922,2847,52818,12:02:34,-120

However right now the best I've been able to do is get the maximum timestamp for each day and departure_delay with the following query

select
max(trip_updates.timestamp) as max, stop_times.trip_id, stops.stop_id, stops.stop_code, stop_times.arrival_time, trips.service_id,
stops.stop_name, stop_times.shape_dist_traveled, stop_time_updates.departure_delay
from stops
inner join stop_times on stops.stop_id = stop_times.stop_id
inner join trips on trips.trip_id = stop_times.trip_id
inner join routes on trips.route_id = routes.route_id
inner join trip_updates on stop_times.trip_id = trip_updates.trip_id
inner join stop_time_updates on trip_updates.oid = stop_time_updates.trip_update_id
where
stop_code = '52818'
and service_id = '1'
and stop_times.arrival_time between '11:00%' and '14:00%'
and route_short_name = '134'
group by stop_times.trip_id, stops.stop_id, stops.stop_code, stop_times.arrival_time, trips.service_id,
stops.stop_name, stop_times.shape_dist_traveled, stop_time_updates.departure_delay
order by stop_times.arrival_time asc, max(trip_updates.timestamp) desc

Which is giving me results like...

timestamp,trip_id,stop_id,stop_code,arrival_time,departure_delay
4/28/2017 18:29,8888922,2847,52818,11:02:34,0
4/28/2017 18:21,8888922,2847,52818,11:02:34,30
4/28/2017 18:16,8888922,2847,52818,11:02:34,60
4/28/2017 18:11,8888922,2847,52818,11:02:34,120
4/27/2017 18:26,8888922,2847,52818,11:02:34,60
4/27/2017 18:22,8888922,2847,52818,11:02:34,30
4/27/2017 18:20,8888922,2847,52818,11:02:34,0

All help is appreciated! Thank you!


Solution

  • As mentioned in my comment, one way to achieve this if you need to select all your current columns would be using the ROW_NUMBER() window function and removing your GROUP BY. For example,

    SELECT [Max] = [timestamp], trip_id, stop_id, stop_code, arrival_time, service_id, stop_name, shape_dist_traveled, departure_delay
    FROM
    (
        SELECT trip_updates.[timestamp], stop_times.trip_id, stops.stop_id, stops.stop_code, stop_times.arrival_time, 
               trips.service_id, stops.stop_name, stop_times.shape_dist_traveled, stop_time_updates.departure_delay,
               RN = ROW_NUMBER() OVER (PARTITION BY stop_times.trip_id, CAST(trip_updates.[timestamp] AS DATE) ORDER BY trip_updates.[timestamp] DESC) 
               -- This assigns a row number for each row within each trip_id and each day, where a row number of 1 will be the highest timestamp.
        FROM ... 
    
        <put all your current joins / where clauses here>
    
        AND route_short_name = '134'
        -- Note: no GROUP BY
    ) AS T
    WHERE RN = 1 -- This ensures you select only the the first row for each trip_id.
    ORDER BY arrival_time, [timestamp];