mysql database performance algorithm path-finding

How to optimize my pathfinding algorithm on a GTFS network

Intro

I'm using a slightly modified GTFS database.

I have a first step algorithm that given two geographical locations provides:

the list of stops around departure and arrival
the list of routes that connects those list of stops

The second step algorithm finds the best journeys matching those stops and routes.
This is working well on direct journeys as well as journeys using one connection.

My problem arises when trying to find the best journey using 2 connections (so there are 3 trips to be searched).

Database

The GTFS format has the following tables (each table has a foreign key to the previous/next table in this list):

stops: stop information (geolocation, name, etc)
stop_times: timetable
trips: itinerary taken by a vehicle (bus, metro, etc)
routes: family of trips that roughly take the same path (e.g. standard and express trips on the same route, but different stops taken)

I have added the following tables

stop_connections: stop to stop connections (around 1 to 20)
stops_routes: lists the available routes at every stop

Here's the table row count in a city where I get slow results (Paris, France):

stops: 28k
stop_times: 12M
trips: 513k
routes: 1k
stop_connections: 365k
stops_routes: 227k

Algorithm

The first step of my algo takes two latitude/longitude points as input, and provides:

the list of stops at each location
the routes that can be used to connect those stops (with up to two connections)

The second step takes each start stop, and analyses the available journeys that use only the routes selected by the first step.

This is the part that I'm trying to optimize. Here's how I'm querying the database:

Query schema written on paper

My search terms (green in the picture):

one departure stop
several arrival stops (1 to 20)
allowed routes at departure, at first connection and on last trip
service ID (not relevant here, can be ignored)

Here's what I do now:

Start from a stop => get timetable => get trips => get routes; filter on allowed routes.
Connect the arrival stops of the first trip to a list of possible stops using stop_connections
Repeat from step 1 two times so that I have 3 trips/2 connections

The problem

This is working fine on some cases, but it can be very slow in others. Usually as soon as I join the timetable or the stop connections, there is a 10x increase of the returned rows. Since I'm joining these table 8 times, there are potentially 10^8 rows to be searched by the engine.

Now I'm sure that I can get this to be more efficient.

My problem is that the number of rows increases at every join, and the arrival stop selection is made at the very end.
I mean I get all the possible journeys from a given stop at a given departure time (there can be millions of combinations), and only when my search reaches the last trip, I can filter on the ~20 allowed arrival stops.
It could be much faster if I could somehow 'know' soon enough that a route isn't worth searching.

Optimizations

Here's what I tried/thought of:

1. Inner join `stops_routes` when joining `stop_connections`

Only select stops at a connection that lead to the allowed routes at next trip.

This is sometimes efficient when there is a lot of connections and not all the connected stops are interesting (some connected stop might only be used by a route we don't want to take).
However this inner join can increase the number of rows if there are not many connected stops and a lot of allowed routes.

2. Partition the `stop_times` table

Create a smaller copy of the stop_times that contains only the timetable of the next two hours or so. Indeed, having the database engine search for the timetable (up to 10pm for example) when my trips starts at 8am is useless. Keeping only 8am-10am is enough and much faster.

This is very efficient, because it dramatically decreases the number of rows to be searched.

I have implemented this with success, it decreased the search time by a factor of about 10x or even 100x.

3. Identify 'good' and 'bad' routes

There is usually, in a metropolitan area, large routes that are very useful when travelling large distances. But these routes aren't the best option when travelling small distances. A human person who knows his own city's public transportation system will quickly tell that from this neighborhood to this other, the best option is to take a specific route.

However this is very difficult to do, and requires a customization on every city.

I plan to make this algo completely independant of the city, so I'm not really willing to go down that road

4. Use crowdsourcing to identify paths that work well

The first search is slow, but the information taken from it can be used to serve fast result to the next person with a similar journey.

However there are so many combinations of departure and arrival stops that the information taken from one query might not be very useful.

I don't know if this is a good idea. I haven't implemented it.

I'm running out of ideas. I know this is not a programming question, but rather a request of ideas on an algorithm. I hope this falls into the SO scope.

Solution

Thanks to zebediah49, I have implemented the following algorithm:

0. Lookup tables

First, I have created an ID on the trips table, that uniquely identifies it. It is based on the list of stops taken in sequence. So this ID guarantees that two trips with the same ID will take exactly the same route.
I called this ID trip_type.

I have improved my stop_connections table so that it includes a cost. This is used to select the best connection when two 'from' stops are connected to the same 'to' stop.

1. Get trips running from the departure stop(s)

Limit those trips to only 1 per trip type (group by trip_type)

2. Get arrival stops from these trips

Select only the best trip if there are two trips reaching the same stop

3. Get connected stops from these arrival stops

Select only the best connection if there are >1 stops that are connected to the same stop

4. Repeat from step 1

I have splitted this into several subqueries and temporary tables, because I can easily group and filter the best stops/trips at each step. This ensures that the minimum searches are sent to the SQL server.

I have stored this algorithm into an SQL procedure, that will do this in a single SQL statement:

call Get2CJourneys(dt, sd, sa, r1, r2, r3)

Where:

dt: departure time
sd: stops at departure point
sa: stops at arrival point
r1, r2, r3: allowed routes for the 1st, 2nd and 3rd trips

The procedure call returns interesting results in <600ms where my previous algorithm returned the same results in several minutes.