sql postgresql inner-join common-table-expression

PostgreSQL proper query structure

I have a Lenovo laptop with Core i5 4210U, 4GB RAM 1600MHz, 500GB HDD, Ubuntu 18.04 running.

In it, I am running a postgres v10 container in docker.

I am an amateur in SQL and completely new in PostgreSQL.

My business logic in brief is

There are users registered in the app
There are fixed named stoppages scattered around an area (city/state)
Users can create routes by using any number of stoppages in any order
Other users can search for a route by mentioning arbitrary start and end location (e.g Home, Work)
The direction of the route must be taken into account while appearing in a search result

Below lies my approach to this problem.

I have created the following tables

Users

create table users (
  id bigserial primary key,
  name text not null,
  email text not null,
  password text not null,
  created_at timestamp with time zone default current_timestamp,
  updated_at timestamp with time zone default current_timestamp);

Stoppages

create table stoppages (
  id bigserial primary key,
  name text not null,
  lat double precision not null,
  lon double precision not null,
  created_at timestamp with time zone default current_timestamp,
  updated_at timestamp with time zone default current_timestamp);

Routes

create table routes (
  id bigserial primary key,
  name text not null,
  user_id bigint references users(id),
  seats smallint not null,
  start_at timestamp with time zone not null,
  created_at timestamp with time zone default current_timestamp,
  updated_at timestamp with time zone default current_timestamp);

Route Stoppage Map

create table route_stoppage_map (
  route_id bigint references routes(id),
  stoppage_id bigint references stoppages(id),
  sl_no smallint not null);

FYI: Here, the sl_no field is the index of the stoppage in that route.

I have used create extension earthdistance cascade; to install cube and earthdistance extensions in this database.

I have also written a utility function in PLPGSQL which is below

create function stp_within(double precision, double precision, double precision)
  returns table (id bigint, name text, lat double precision, 
                 lon double precision, created_at timestamp with time zone,
                 updated_at timestamp with time zone)
    as $$
      begin
        return query select * from stoppages where
          earth_distance(ll_to_earth(stoppages.lat, stoppages.lon), ll_to_earth($1, $2)) <= $3;
      end;
    $$ language plpgsql;

This function returns the stoppages that are at a particular radius(in meters) from a specific geo-location.

The query I am using to fetch routes from geo location 22.449227, 88.302977 to 22.599199, 88.423370. The default radius I a using is 2000 metres.

The query I have managed to write is below

with start_location as (select * from stp_within(22.449227, 88.302977, 2000)),

  end_location as (select * from stp_within(22.599199, 88.423370, 2000)),

  starting_routes as (select route_id, sl_no from route_stoppage_map where stoppage_id in (select id from start_location)),

  ending_routes as (select route_id, sl_no from route_stoppage_map where stoppage_id in (select id from end_location)),

  matches as (select distinct starting_routes.route_id from starting_routes inner join ending_routes on 
    starting_routes.route_id = ending_routes.route_id and starting_routes.sl_no < ending_routes.sl_no),

  selected_routes as (select name, user_id from routes where id in (select route_id from matches))

  select selected_routes.name as route_name, users.name as user_name from users inner join selected_routes on users.id = selected_routes.user_id;

This is fetching me the bare minimum results. But it is not complete and I can't seem to figure out a way to solve the following issues

I need the nearest stoppage on both ends in the result (i.e The nearest stoppage the user can board from and the nearest stoppage from his/her destination).
The query is really slow. With explain analyze I found 0.931 ms planning time and 20.728 ms execution time, when there are just 2 users, 5 routes and 7 stoppages and each route has only 3-5 stoppages.
Is it possible to write the same query in a more efficient manner?

Please forgive me if I have missed any information(s).

Please help me address the problems stated above.

EDIT: The output of explain (analyze, buffers)

Result {
  command: 'EXPLAIN',
  rowCount: null,
  oid: null,
  rows:
   [ { 'QUERY PLAN': 'Hash Join  (cost=410.84..415.36 rows=200 width=64) (actual time=2.494..2.499 rows=3 loops=1)' },
     { 'QUERY PLAN': '  Hash Cond: (selected_routes.user_id = users.id)' },
     { 'QUERY PLAN': '  Buffers: shared hit=487' },
     { 'QUERY PLAN': '  CTE start_location' },
     { 'QUERY PLAN': '    ->  Function Scan on stp_within  (cost=0.25..10.25 rows=1000 width=72) (actual time=1.812..1.813 rows=3 loops=1)' },
     { 'QUERY PLAN': '          Buffers: shared hit=482' },
     { 'QUERY PLAN': '  CTE end_location' },
     { 'QUERY PLAN': '    ->  Function Scan on stp_within stp_within_1  (cost=0.25..10.25 rows=1000 width=72) (actual time=0.567..0.568 rows=2 loops=1)' },
     { 'QUERY PLAN': '          Buffers: shared hit=1' },
     { 'QUERY PLAN': '  CTE starting_routes' },
     { 'QUERY PLAN': '    ->  Hash Join  (cost=27.00..69.19 rows=885 width=10) (actual time=1.835..1.842 rows=9 loops=1)' },
     { 'QUERY PLAN': '          Hash Cond: (route_stoppage_map.stoppage_id = start_location.id)' },
     { 'QUERY PLAN': '          Buffers: shared hit=483' },
     { 'QUERY PLAN': '          ->  Seq Scan on route_stoppage_map  (cost=0.00..27.70 rows=1770 width=18) (actual time=0.002..0.004 rows=23 loops=1)' },
     { 'QUERY PLAN': '                Buffers: shared hit=1' },
     { 'QUERY PLAN': '          ->  Hash  (cost=24.50..24.50 rows=200 width=8) (actual time=1.825..1.825 rows=3 loops=1)' },
     { 'QUERY PLAN': '                Buckets: 1024  Batches: 1  Memory Usage: 9kB' },
     { 'QUERY PLAN': '                Buffers: shared hit=482' },
     { 'QUERY PLAN': '                ->  HashAggregate  (cost=22.50..24.50 rows=200 width=8) (actual time=1.822..1.823 rows=3 loops=1)' },
     { 'QUERY PLAN': '                      Group Key: start_location.id' },
     { 'QUERY PLAN': '                      Buffers: shared hit=482' },
     { 'QUERY PLAN': '                      ->  CTE Scan on start_location  (cost=0.00..20.00 rows=1000 width=8) (actual time=1.813..1.816 rows=3 loops=1)' },
     { 'QUERY PLAN': '                            Buffers: shared hit=482' },
     { 'QUERY PLAN': '  CTE ending_routes' },
     { 'QUERY PLAN': '    ->  Hash Join  (cost=27.00..69.19 rows=885 width=10) (actual time=0.585..0.590 rows=7 loops=1)' },
     { 'QUERY PLAN': '          Hash Cond: (route_stoppage_map_1.stoppage_id = end_location.id)' },
     { 'QUERY PLAN': '          Buffers: shared hit=2' },
     { 'QUERY PLAN': '          ->  Seq Scan on route_stoppage_map route_stoppage_map_1  (cost=0.00..27.70 rows=1770 width=18) (actual time=0.003..0.005 rows=23 loops=1)' },
     { 'QUERY PLAN': '                Buffers: shared hit=1' },
     { 'QUERY PLAN': '          ->  Hash  (cost=24.50..24.50 rows=200 width=8) (actual time=0.577..0.577 rows=2 loops=1)' },
     { 'QUERY PLAN': '                Buckets: 1024  Batches: 1  Memory Usage: 9kB' },
     { 'QUERY PLAN': '                Buffers: shared hit=1' },
     { 'QUERY PLAN': '                ->  HashAggregate  (cost=22.50..24.50 rows=200 width=8) (actual time=0.574..0.575 rows=2 loops=1)' },
     { 'QUERY PLAN': '                      Group Key: end_location.id' },
     { 'QUERY PLAN': '                      Buffers: shared hit=1' },
     { 'QUERY PLAN': '                      ->  CTE Scan on end_location  (cost=0.00..20.00 rows=1000 width=8) (actual time=0.568..0.569 rows=2 loops=1)' },
     { 'QUERY PLAN': '                            Buffers: shared hit=1' },
     { 'QUERY PLAN': '  CTE matches' },
     { 'QUERY PLAN': '    ->  Unique  (cost=122.04..198.25 rows=200 width=8) (actual time=2.451..2.458 rows=3 loops=1)' },
     { 'QUERY PLAN': '          Buffers: shared hit=485' },
     { 'QUERY PLAN': '          ->  Merge Join  (cost=122.04..194.99 rows=1305 width=8) (actual time=2.450..2.456 rows=5 loops=1)' },
     { 'QUERY PLAN': '                Merge Cond: (starting_routes.route_id = ending_routes.route_id)' },
     { 'QUERY PLAN': '                Join Filter: (starting_routes.sl_no < ending_routes.sl_no)' },
     { 'QUERY PLAN': '                Rows Removed by Join Filter: 7' },
     { 'QUERY PLAN': '                Buffers: shared hit=485' },
     { 'QUERY PLAN': '                ->  Sort  (cost=61.02..63.23 rows=885 width=10) (actual time=1.852..1.852 rows=9 loops=1)' },
     { 'QUERY PLAN': '                      Sort Key: starting_routes.route_id' },
     { 'QUERY PLAN': '                      Sort Method: quicksort  Memory: 25kB' },
     { 'QUERY PLAN': '                      Buffers: shared hit=483' },
     { 'QUERY PLAN': '                      ->  CTE Scan on starting_routes  (cost=0.00..17.70 rows=885 width=10) (actual time=1.836..1.844 rows=9 loops=1)' },
     { 'QUERY PLAN': '                            Buffers: shared hit=483' },
     { 'QUERY PLAN': '                ->  Sort  (cost=61.02..63.23 rows=885 width=10) (actual time=0.596..0.597 rows=10 loops=1)' },
     { 'QUERY PLAN': '                      Sort Key: ending_routes.route_id' },
     { 'QUERY PLAN': '                      Sort Method: quicksort  Memory: 25kB' },
     { 'QUERY PLAN': '                      Buffers: shared hit=2' },
     { 'QUERY PLAN': '                      ->  CTE Scan on ending_routes  (cost=0.00..17.70 rows=885 width=10) (actual time=0.586..0.592rows=7 loops=1)' },
     { 'QUERY PLAN': '                            Buffers: shared hit=2' },
     { 'QUERY PLAN': '  CTE selected_routes' },
     { 'QUERY PLAN': '    ->  Hash Join  (cost=9.00..31.32 rows=200 width=40) (actual time=2.483..2.485 rows=3 loops=1)' },
     { 'QUERY PLAN': '          Hash Cond: (routes.id = matches.route_id)' },
     { 'QUERY PLAN': '          Buffers: shared hit=486' },
     { 'QUERY PLAN': '          ->  Seq Scan on routes  (cost=0.00..18.00 rows=800 width=48) (actual time=0.004..0.004 rows=5 loops=1)' },
     { 'QUERY PLAN': '                Buffers: shared hit=1' },
     { 'QUERY PLAN': '          ->  Hash  (cost=6.50..6.50 rows=200 width=8) (actual time=2.466..2.466 rows=3 loops=1)' },
     { 'QUERY PLAN': '                Buckets: 1024  Batches: 1  Memory Usage: 9kB' },
     { 'QUERY PLAN': '                Buffers: shared hit=485' },
     { 'QUERY PLAN': '                ->  HashAggregate  (cost=4.50..6.50 rows=200 width=8) (actual time=2.464..2.465 rows=3 loops=1)' },
     { 'QUERY PLAN': '                      Group Key: matches.route_id' },
     { 'QUERY PLAN': '                      Buffers: shared hit=485' },
     { 'QUERY PLAN': '                      ->  CTE Scan on matches  (cost=0.00..4.00 rows=200 width=8) (actual time=2.453..2.461 rows=3 loops=1)' },
     { 'QUERY PLAN': '                            Buffers: shared hit=485' },
     { 'QUERY PLAN': '  ->  CTE Scan on selected_routes  (cost=0.00..4.00 rows=200 width=40) (actual time=2.484..2.488 rows=3 loops=1)' },
     { 'QUERY PLAN': '        Buffers: shared hit=486' },
     { 'QUERY PLAN': '  ->  Hash  (cost=15.50..15.50 rows=550 width=40) (actual time=0.005..0.005 rows=2 loops=1)' },
     { 'QUERY PLAN': '        Buckets: 1024  Batches: 1  Memory Usage: 9kB' },
     { 'QUERY PLAN': '        Buffers: shared hit=1' },
     { 'QUERY PLAN': '        ->  Seq Scan on users  (cost=0.00..15.50 rows=550 width=40) (actual time=0.004..0.004 rows=2 loops=1)' },
     { 'QUERY PLAN': '              Buffers: shared hit=1' },
     { 'QUERY PLAN': 'Planning time: 0.642 ms' },
     { 'QUERY PLAN': 'Execution time: 3.471 ms' } ],
  fields:
   [ Field {
       name: 'QUERY PLAN',
       tableID: 0,
       columnID: 0,
       dataTypeID: 25,
       dataTypeSize: -1,
       dataTypeModifier: -1,
       format: 'text' } ],
  _parsers: [ [Function: noParse] ],
  RowCtor: null,
  rowAsArray: false,
  _getTypeParser: [Function: bound ] }

Solution

You should inline the common table expressions. You could also inline the procedure.

Once the data gets larger, those 'distincts' will hurt you as well. I prefer not to use "SELECT *" - instead itemize out which columns you need, as it makes it easy to write indexes later.

This should be much closer to what you need:

select 
  selected_routes.name as route_name, 
  users.name as user_name 
from users 
join (
  select 
    name, 
    user_id 
  from routes 
  where id in (
    select starting_routes.route_id 
    from (
      select 
        route_id, 
        sl_no 
      from route_stoppage_map 
      where stoppage_id in (
        select id
        from stoppages 
        where earth_distance(
          ll_to_earth(stoppages.lat, stoppages.lon), 
          ll_to_earth(22.449227, 88.302977)) <= 2000
      )
    ) starting_routes 
    join (
      select route_id, sl_no 
      from route_stoppage_map 
      where stoppage_id in (
        select id 
        from stoppages 
        where earth_distance(
          ll_to_earth(stoppages.lat, stoppages.lon), 
          ll_to_earth(22.599199, 88.423370)) <= 2000
      )
    ) ending_routes 
    on starting_routes.route_id = ending_routes.route_id 
    and starting_routes.sl_no < ending_routes.sl_no
  )
) selected_routes on users.id = selected_routes.user_id

If you want to test the performance of the query, you'll also get much more accurate results if you increase the amount of data - if you increase the size of your dataset to the point where this takes 10-60 seconds, your attempts to tune the query will be much more fruitful, as any one-off operations become rounding errors (time spent retrieving/rendering results, opening/closing connections, etc).