sql postgresql query-optimization common-table-expression

Using CTE in where statement is very slow

I have a following SQL running in PostgreSQL v13.10:

WITH stuckable_statuses AS (
  SELECT status_id FROM status_descriptions
  WHERE (tags @> ARRAY['stuckable']::varchar[])
)

SELECT jobs.* FROM jobs
WHERE jobs.status = ANY(select status_id from stuckable_statuses)

And it is running really slow while replacing ANY(select status_id from stuckable_statuses) with array of ids ex. (1,2,3) is running really fast.

Here is the explain analyze for this query:

Gather  (cost=1005.64..5579003.45 rows=1563473 width=2518) (actual time=45.495..40138.515 rows=303 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  ->  Hash Semi Join  (cost=5.64..5421656.15 rows=651447 width=2518) (actual time=44.533..40126.793 rows=101 loops=3)
        Hash Cond: (jobs.status = status_descriptions.status_id)
        ->  Parallel Seq Scan on jobs  (cost=0.00..5378777.15 rows=13571815 width=2518) (actual time=0.892..38662.091 rows=10537079 loops=3)
        ->  Hash  (cost=5.56..5.56 rows=6 width=4) (actual time=0.377..0.378 rows=11 loops=3)
              Buckets: 1024  Batches: 1  Memory Usage: 9kB
              ->  Seq Scan on status_descriptions  (cost=0.00..5.56 rows=6 width=4) (actual time=0.310..0.370 rows=11 loops=3)
                    Filter: (tags @> '{stuckable}'::character varying[])
                    Rows Removed by Filter: 146
Planning Time: 0.711 ms
Execution Time: 40138.654 ms

Here is table definitions (taken form rails' schema.rb):


  create_table "jobs", id: :serial, force: :cascade do |t|
    t.string "filename"
    t.string "sandbox"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.integer "status", default: 0, null: false
    t.integer "provider_id"
    t.integer "lang_id"
    t.integer "profile_id"
    t.datetime "extra_date"
    t.datetime "main_date"
    t.datetime "performer_id"
    t.index ["provider_id", "status", "extra_date"], name: "jobs_on_media_provider_id__status__extra_date"
    t.index ["provider_id", "status", "main_date"], name: "jobs_on_media_provider_id_and_status_and_due_date"
    t.index ["profile_id", "status", "extra_date"], name: "index_jobs_on_profile_id__status__extra_date"
    t.index ["profile_id", "status", "main_date"], name: "index_transcription_jobs_on_profile_id_and_status_and_due_date"
    t.index ["status", "sandbox", "lang_id", "extra_date"], name: "index_jobs_on_status__sandbox__lang_id__extra_date"
    t.index ["status", "sandbox", "lang_id", "main_date"], name: "index_jobs_on_status_and_sandbox_and_lang_id_and_due_date"
    t.index ["performer_id", "status", "extra_date"], name: "index_jobs_on_performer_id__status__extra_date"
    t.index ["performer_id", "status", "main_date"], name: "index_jobs_on_performer_id_and_status_and_due_date"
  end

  create_table "status_descriptions", id: :serial, force: :cascade do |t|
    t.integer "status_id"
    t.string "title"
    t.string "tags", array: true
    t.index ["status_id"], name: "index_status_descriptions_on_status_id"
  end

Comparing to same SQL with array I can see that it is not using index by jobs.status maybe because jobs table is really huge (~15kk rows) while status_descriptions is ~200 rows.

Could you please help me to optimize this SQL if it is possible.

Thank you!

UPDATE:

Here is the query with hardcoded array:

SELECT jobs.* FROM transcription_jobs
WHERE jobs.status IN (2, 3, 4, 291, 290, 46, 142, 260, 6, 7, 270)

And here its explain analyze:

Index Scan using index_jobs_on_status__sandbox__lang_id__current_stage_due_date on jobs  (cost=0.56..98661.05 rows=26541 width=2518) (actual time=0.032..63.266 rows=483 loops=1)
  Index Cond: (status = ANY ('{2,3,4,291,290,46,142,260,6,7,270}'::integer[]))
Planning Time: 0.356 ms
Execution Time: 63.337 ms

Solution

The main problem is that it thinks it will find 1563473 rows but actually finds 303. If there actually were 1563473 rows found, the hash join over the seq scan probably really would be faster than the ANY-driven index scan.

Unfortunately there probably isn't anything you can do about that with your current data model and the existing versions of PostgreSQL. It looks like the stuckable statuses are substantially rarer in the jobs table than the non-stuckable ones, but the planner has no way to know that.

To force the faster plan, you could temporarily turn off either enable_hashjoin or enable_seqscan just before running this query. That is definitely an ugly solution, but it should be a reliable one. If you turn off parallel query (set max_parallel_workers_per_gather=0), that might be enough to switch the plan to the faster one. If you don't get much benefit from parallel query anyway, then this would be a less ugly solution, but also less reliable. Or, you could try to add plan hints.

The most robust solution is probably just to run it as two queries, fetching the array/list of status_id in one, and then stuffing that array/list into the 2nd one. That way the planner can actually see the values which are going to be used, and can plan accordingly. (I notice that your hard-coded plan is still substantially misestimated, but not by nearly as much as the other one also not by enough to drive the plan choice. This is likely a separate issue, unrelated to the one you currently face.)

For your data model, I'd intuit that the stickability of a given status is not something that would change very often, if ever. If you could record that value directly as a new column in the jobs table and change it whenever the status itself changes, rather than needing to do indirection to a separate table, that would pretty much instantly solve this problem.