I'm using a slightly modified GTFS database.
I have a first step algorithm that given two geographical locations provides:
The second step algorithm finds the best journeys matching those stops and routes.
This is working well on direct journeys as well as journeys using one connection.
My problem arises when trying to find the best journey using 2 connections (so there are 3 trips to be searched).
The GTFS format has the following tables (each table has a foreign key to the previous/next table in this list):
stops
: stop information (geolocation, name, etc)stop_times
: timetabletrips
: itinerary taken by a vehicle (bus, metro, etc)routes
: family of trips that roughly take the same path (e.g. standard and express trips on the same route, but different stops taken)I have added the following tables
stop_connections
: stop
to stop
connections (around 1 to 20)stops_routes
: lists the available routes
at every stop
Here's the table row count in a city where I get slow results (Paris, France):
stops
: 28kstop_times
: 12Mtrips
: 513kroutes
: 1kstop_connections
: 365kstops_routes
: 227kThe first step of my algo takes two latitude/longitude points as input, and provides:
The second step takes each start stop, and analyses the available journeys that use only the routes selected by the first step.
This is the part that I'm trying to optimize. Here's how I'm querying the database:
My search terms (green in the picture):
Here's what I do now:
stop_connections
This is working fine on some cases, but it can be very slow in others. Usually as soon as I join the timetable or the stop connections, there is a 10x increase of the returned rows. Since I'm joining these table 8 times, there are potentially 10^8 rows to be searched by the engine.
Now I'm sure that I can get this to be more efficient.
My problem is that the number of rows increases at every join, and the arrival stop selection is made at the very end.
I mean I get all the possible journeys from a given stop at a given departure time (there can be millions of combinations), and only when my search reaches the last trip, I can filter on the ~20 allowed arrival stops.
It could be much faster if I could somehow 'know' soon enough that a route isn't worth searching.
Here's what I tried/thought of:
stops_routes
when joining stop_connections
Only select stops at a connection that lead to the allowed routes at next trip.
This is sometimes efficient when there is a lot of connections and not all the connected stops are interesting (some connected stop might only be used by a route we don't want to take).
However this inner join can increase the number of rows if there are not many connected stops and a lot of allowed routes.
stop_times
tableCreate a smaller copy of the
stop_times
that contains only the timetable of the next two hours or so. Indeed, having the database engine search for the timetable (up to 10pm for example) when my trips starts at 8am is useless. Keeping only 8am-10am is enough and much faster.
This is very efficient, because it dramatically decreases the number of rows to be searched.
I have implemented this with success, it decreased the search time by a factor of about 10x or even 100x.
There is usually, in a metropolitan area, large routes that are very useful when travelling large distances. But these routes aren't the best option when travelling small distances. A human person who knows his own city's public transportation system will quickly tell that from this neighborhood to this other, the best option is to take a specific route.
However this is very difficult to do, and requires a customization on every city.
I plan to make this algo completely independant of the city, so I'm not really willing to go down that road
The first search is slow, but the information taken from it can be used to serve fast result to the next person with a similar journey.
However there are so many combinations of departure and arrival stops that the information taken from one query might not be very useful.
I don't know if this is a good idea. I haven't implemented it.
I'm running out of ideas. I know this is not a programming question, but rather a request of ideas on an algorithm. I hope this falls into the SO scope.
Thanks to zebediah49, I have implemented the following algorithm:
First, I have created an ID on the trips table, that uniquely identifies it. It is based on the list of stops taken in sequence. So this ID guarantees that two trips with the same ID will take exactly the same route.
I called this ID trip_type
.
I have improved my stop_connections
table so that it includes a cost. This is used to select the best connection when two 'from' stops are connected to the same 'to' stop.
Limit those trips to only 1 per trip type (group by trip_type
)
Select only the best trip if there are two trips reaching the same stop
Select only the best connection if there are >1 stops that are connected to the same stop
I have splitted this into several subqueries and temporary tables, because I can easily group and filter the best stops/trips at each step. This ensures that the minimum searches are sent to the SQL server.
I have stored this algorithm into an SQL procedure, that will do this in a single SQL statement:
call Get2CJourneys(dt, sd, sa, r1, r2, r3)
Where:
dt
: departure timesd
: stops at departure pointsa
: stops at arrival pointr1
, r2
, r3
: allowed routes for the 1st, 2nd and 3rd tripsThe procedure call returns interesting results in <600ms where my previous algorithm returned the same results in several minutes.