sql ruby-on-rails postgresql heroku-postgres

Optimizing SQL JOIN call

I have an SQL query that is performing poorly. I've done some research on joins, watched tutorials, ensured that I have the right indexes defined, etc but honestly I am a bit lost as to how to improve performance of this much-called query.

I have the following schema definitions:

create_table "training_plans", :force => true do |t|
  t.integer  "user_id"
end

add_index "training_plans", ["user_id"], :name => "index_training_plans_on_user_id"

create_table "training_weeks", :force => true do |t|
  t.integer  "training_plan_id"
  t.date     "start_date"
end

add_index "training_weeks", ["training_plan_id", "start_date"], :name => "index_training_weeks_on_training_plan_id_and_start_date"
add_index "training_weeks", ["training_plan_id"], :name => "index_training_weeks_on_training_plan_id"

create_table "training_efforts", :force => true do |t|
  t.string   "name"
  t.date     "plandate"
  t.integer  "training_week_id"
end

add_index "training_efforts", ["plandate"], :name => "index_training_efforts_on_plandate"
add_index "training_efforts", ["training_week_id", "plandate"], :name => "index_training_efforts_on_training_week_id_and_plandate"
add_index "training_efforts", ["training_week_id"], :name => "index_training_efforts_on_training_week_id"

And then the following call to gather all of the training_efforts associated with a specific training_plan, including all of the related ride objects, where the training_effort plandates are within a target date range, ordering the result by the planate.

    tefts = self.training_efforts.includes(:rides).order("plandate ASC").where("plandate >= ? AND plandate <= ?",
                                                      beginning_date,
                                                      end_date)

This produces the following query output:

TrainingEffort Load (3393.6ms)  SELECT "training_efforts".* FROM "training_efforts" 
  INNER JOIN "training_weeks" ON "training_efforts"."training_week_id" = "training_weeks"."id" 
  WHERE "training_weeks"."training_plan_id" = 104 
  AND (plandate >= '2015-01-05' AND plandate <= '2016-01-03') ORDER BY plandate ASC

I believe that I have the right indexes defined. The table is not that large. Yet, this is taking a huge amount of time. As further background, this is on Heroku Postgres. Lastly I'll mention that on my development system, the query is slower than most (3.3ms) but still not anywhere near 1000x slower than the average...

Thanks in advance for any help optimizing this query.

UPDATE Here is the EXPLAIN output for the query (issued on my dev system):

explain SELECT "training_efforts".* FROM "training_efforts" INNER JOIN "training_weeks" 
  ON "training_efforts"."training_week_id" = "training_weeks"."id" 
  WHERE "training_weeks"."training_plan_id" = 7 
  AND (plandate >= '2015-01-05' AND plandate <= '2016-01-03') ORDER BY plandate ASC;
                                          QUERY PLAN                                           
-----------------------------------------------------------------------------------------------
 Sort  (cost=430.52..432.04 rows=606 width=120)
   Sort Key: training_efforts.plandate
   ->  Hash Join  (cost=15.12..402.51 rows=606 width=120)
         Hash Cond: (training_efforts.training_week_id = training_weeks.id)
         ->  Seq Scan on training_efforts  (cost=0.00..377.25 rows=1089 width=120)
               Filter: ((plandate >= '2015-01-05'::date) AND (plandate <= '2016-01-03'::date))
         ->  Hash  (cost=11.86..11.86 rows=261 width=4)
               ->  Seq Scan on training_weeks  (cost=0.00..11.86 rows=261 width=4)
                     Filter: (training_plan_id = 7)

UPDATE 2 Trying a different query to see if my indexes will be used and noting that there are 7x the number of training_efforts when compared to training_weeks (both of which have date columns), I'll try searching on the training_week dates instead of the training_effort dates as follows:

explain SELECT "training_efforts".* FROM "training_efforts" INNER JOIN "training_weeks" 
  ON "training_weeks"."id" = "training_efforts"."training_week_id" 
  WHERE "training_weeks"."id" IN (SELECT "training_weeks"."id" FROM "training_weeks" 
  WHERE "training_weeks"."training_plan_id" = 7 AND (start_date >= '2015-01-05' AND start_date <= '2016-01-03')) 
  ORDER BY plandate ASC;
                                                                     QUERY PLAN                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=376.83..378.34 rows=602 width=120)
   Sort Key: training_efforts.plandate
   ->  Nested Loop  (cost=14.23..349.04 rows=602 width=120)
         ->  Hash Semi Join  (cost=13.95..26.83 rows=86 width=8)
               Hash Cond: (training_weeks.id = training_weeks_1.id)
               ->  Seq Scan on training_weeks  (cost=0.00..10.69 rows=469 width=4)
               ->  Hash  (cost=12.87..12.87 rows=86 width=4)
                     ->  Bitmap Heap Scan on training_weeks training_weeks_1  (cost=5.37..12.87 rows=86 width=4)
                           Recheck Cond: ((training_plan_id = 7) AND (start_date >= '2015-01-05'::date) AND (start_date <= '2016-01-03'::date))
                           ->  Bitmap Index Scan on index_training_weeks_on_training_plan_id_and_start_date  (cost=0.00..5.35 rows=86 width=0)
                                 Index Cond: ((training_plan_id = 7) AND (start_date >= '2015-01-05'::date) AND (start_date <= '2016-01-03'::date))
         ->  Index Scan using index_training_efforts_on_training_week_id on training_efforts  (cost=0.28..3.68 rows=7 width=120)
               Index Cond: (training_week_id = training_weeks.id)

This appears to be slightly better but still I'm not at all confident that this is optimized...

Solution

How many rows are in each table? Did you re-create these table recently or are they old? Have you analyzed the tables recently? It looks like it's doing seq_scans and not using any of your indexes.

I'd issue a

vacuum analyze

on your entire DB, or at least these two tables. Many times the optimizer will skip indexes if it doesn't have the right statistics on the table.