Search code examples
c#asp.netperformancelinqgtfs

How can I make this LINQ query of an Enumerable DataTable of GTFS data faster?


I'm working with the GTFS data for the New York City MTA subway system. I need to find the stop times for each route at a specific stop. To do that, I get the stop times from a StopTimes DataTable that I have, for a specific stop_id. I only want stop times between now and the next 2 hours.

Then, I need to lookup the trip for each stop time, using the trip_id value. From that trip, I have to lookup the route, using the route_id value, in order to get the route name or number for the stop time.

Here are the counts for each DataTable: StopTimes(522712), Trips(19092), Routes(27).

Right now, this takes anywhere from 20 seconds to 40 seconds to execute. How can I speed this up? Any and all suggestions are appreciated. Thanks!

foreach (var r in StopTimes.OrderBy(z => z.Field<DateTime>("departure_time").TimeOfDay)
                           .Where(z => z.Field<string>("stop_id") == stopID &&
                                  z["departure_time"].ToString() != "" &&
                                  z.Field<DateTime>("departure_time").TimeOfDay >= DateTime.UtcNow.AddHours(-5).TimeOfDay &&
                                  z.Field<DateTime>("departure_time").TimeOfDay <= DateTime.UtcNow.AddHours(-5).AddHours(2).TimeOfDay))
        {
            var trip = (from z in Trips
                        where z.Field<string>("trip_id") == r.Field<string>("trip_id") &&
                              z["route_id"].ToString() != ""
                        select z).Single();

            var route = (from z in Routes
                         where z.Field<string>("route_id") == trip.Field<string>("route_id")
                         select z).Single();

            // do stuff (not time-consuming)
        }

Solution

  • Try this:

    var now = DateTime.UtcNow;
    var tod0 = now.AddHours(-5).TimeOfDay;
    var tod1 = now.AddHours(-5).AddHours(2).TimeOfDay;
    
    var sts =
        from st in StopTimes
        let StopID = st.Field<string>("stop_id")
        where StopID == stopID
        where st["departure_time"].ToString() != ""
        let DepartureTime = st.Field<DateTime>("departure_time").TimeOfDay
        where DepartureTime >= tod0
        where DepartureTime >= tod1
        let TripID = st.Field<string>("trip_id")
        select new
        {
            StopID,
            TripID,
            DepartureTime,
        };
    

    Note that there is no orderby in this query and that we're returning an anonymous type. For your "do stuff (not time-consuming)" code to run you may need to add some more properties.

    The same approach happens for Trips & Routes.

    var ts =
        from t in Trips
        where t["route_id"].ToString() != ""
        let TripID = t.Field<string>("trip_id")
        let RouteID = t.Field<string>("route_id")
        select new
        {
            TripID,
            RouteID,
        };
    
    var rs =
        from r in Routes
        let RouteID = r.Field<string>("route_id")
        select new
        {
            RouteID,
        };
    

    Since you're getting a single record for each look up then using ToDictionary(...) is a good choice to use.

    var tripLookup = ts.ToDictionary(t => t.TripID);
    var routeLookup = rs.ToDictionary(r => r.RouteID);
    

    Now your query looks like this:

    var query = from StopTime in sts.ToArray()
                let Trip = tripLookup[StopTime.TripID]
                let Route = routeLookup[Trip.RouteID]
                orderby StopTime.DepartureTime
                select new
                {
                    StopTime,
                    Trip,
                    Route,
                };
    

    Notice that I've used .ToArray() and I've put the orderby right at the end.

    And you run your code like this:

    foreach (var q in query)
    {
        // do stuff (not time-consuming)
    }
    

    Let me know if this helps.