sql database postgresql count postgresql-performance

What is wrong with the way I count rows in a complex query?

I have a database with a few tables, each has a few millions rows (tables do have indexes). I need to count rows in a table, but only those whose foreign key field points to a subset from another table.
Here is the query:

WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  ( top.production_year >= 1982 
                  AND top.production_year <= 1984 
                  AND top.kind_id IN( 1, 2 ) 
                   OR EXISTS(SELECT 1 
                             FROM   title sub 
                             WHERE  sub.episode_of_id = top.id 
                                    AND sub.production_year >= 1982 
                                    AND sub.production_year <= 1984 
                                    AND sub.kind_id IN( 1, 2 )) )) 
SELECT Count(*) 
FROM   cast_info 
WHERE  EXISTS(SELECT 1 
              FROM   filtered_title 
              WHERE  cast_info.movie_id = filtered_title.id) 
       AND cast_info.role_id IN( 3, 8 )

I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries. But I have tried to get rid of CTE and the results were the same: the first time I execute the query it runs... runs... and runs for more than ten minutes. The second time I execute the query, it's down to 4 seconds, which is acceptable for me.

The result of EXPLAIN ANALYZE:

Aggregate  (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
  CTE filtered_title
    ->  Seq Scan on title top  (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
          Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
          Rows Removed by Filter: 2832906
          SubPlan 1
            ->  Index Scan using title_idx_epof on title sub  (cost=0.43..16.16 rows=1 width=0) (never executed)
                  Index Cond: (episode_of_id = top.id)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
          SubPlan 2
            ->  Seq Scan on title sub_1  (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
                  Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
                  Rows Removed by Filter: 2832906
  ->  Nested Loop  (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
        ->  HashAggregate  (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
              ->  CTE Scan on filtered_title  (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
        ->  Index Scan using cast_info_idx_mid on cast_info  (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
              Index Cond: (movie_id = filtered_title.id)
              Filter: (role_id = ANY ('{3,8}'::integer[]))
              Rows Removed by Filter: 15
Total runtime: 127729.100 ms

Now to my question. What am I doing wrong and how can I fix it?

I tried a few variants of the same query: exclusive joins, joins/exists. On one hand this one seems to require the least time to do the job (10x faster), but it's still 60 seconds on average. On the other hand, unlike my first query which needs 4-6 seconds on the second run, it always requires 60 seconds.

WITH filtered_title 
     AS (SELECT top.id 
         FROM   title top 
         WHERE  top.production_year >= 1982 
                AND top.production_year <= 1984 
                AND top.kind_id IN( 1, 2 ) 
                 OR EXISTS(SELECT 1 
                           FROM   title sub 
                           WHERE  sub.episode_of_id = top.id 
                                  AND sub.production_year >= 1982 
                                  AND sub.production_year <= 1984 
                                  AND sub.kind_id IN( 1, 2 ))) 
SELECT Count(*) 
FROM   cast_info 
       join filtered_title 
         ON cast_info.movie_id = filtered_title.id 
WHERE  cast_info.role_id IN( 3, 8 )

Solution

^{Disclaimer: There are way too many factors in play to give a conclusive answer. The information with a few tables, each has a few millions rows (tables do have indexes) just doesn't cut it. It depends on cardinalities, table definitions, data types, usage patterns and (probably most important) indexes. And a proper basic configuration of your db server, of course. All of this goes beyond the scope of a single question on SO. Start with the links in the postgresql-performance tag. Or hire a professional.}

I am going to address the most prominent detail (for me) in your query plan:

Sequential scan on `title`?

-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906

Bold emphasis mine. Sequentially scanning 3 million rows to retrieve just 16250 is not very efficient. The sequential scan is also the likely reason why the first run takes so much longer. Subsequent calls can read data from the cache. Since the table is big, data probably won't stay in the cache for long unless you have huge amounts of cache.

An index scan is typically substantially faster to gather 0.5 % of the rows from a big table. Possible causes:

My money is on the index. You didn't supply your version of Postgres, so assuming current 9.3. The perfect index for this query would be:

CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)

Data types matter. Sequence of columns in the index matters.
kind_id first, because the rule of thumb is: index for equality first — then for ranges.
The last two columns (id, episode_of_id) are only useful for a potential index-only scan. If not applicable, drop those. More details here:
PostgreSQL composite primary key

The way you built your query, you end up with two sequential scans on the big table. So here's an educated guess for a ...

Better query

WITH t_base AS (
   SELECT id, episode_of_id
   FROM   title
   WHERE  kind_id BETWEEN 1 AND 2
   AND    production_year BETWEEN 1982 AND 1984 
   )
, t_all AS (
   SELECT id FROM t_base

   UNION  -- not UNION ALL (!)
   SELECT id
   FROM  (SELECT DISTINCT episode_of_id AS id FROM t_base) x
   JOIN   title t USING (id)
   )
SELECT count(*) AS ct
FROM   cast_info c
JOIN   t_all t ON t.id = c.movie_id 
WHERE  c.role_id IN (3, 8);

This should give you one index scan on the new title_foo_idx, plus another index scan on the pk index of title. The rest should be relatively cheap. With any luck, much faster than before.

kind_id BETWEEN 1 AND 2 .. as long as you have a continuous range of values, that is faster than listing individual values because this way Postgres can fetch a continuous range from the index. Not very important for just two values.

Test this alternative for the second leg of t_all. Not sure which is faster:

   SELECT id
   FROM   title t 
   WHERE  EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)

Temporary table instead of CTE

You write:

I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries.

A CTE poses as optimization barrier, the resulting internal work table is not indexed. When reusing a result (with more than a trivial number of rows) multiple times, it pays to use an indexed temp table instead. Index creation for a simple int column is fast.

CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
   SELECT id, episode_of_id
   FROM   title
   WHERE  kind_id BETWEEN 1 AND 2
   AND    production_year BETWEEN 1982 AND 1984 
   )
SELECT id FROM t_base
UNION
SELECT id FROM title t 
WHERE  EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);

ANALYZE t_tmp;                       -- !
CREATE UNIQUE INDEX ON t_tmp (id);   -- ! (unique is optional)

SELECT count(*) AS ct
FROM   cast_info c
JOIN   t_tmp t ON t.id = c.movie_id 
WHERE  c.role_id IN (3, 8);

-- More queries using t_tmp

About temp tables:
How to tell if record has changed in Postgres

What is wrong with the way I count rows in a complex query?

Sequential scan on title?

Better query

Temporary table instead of CTE

Sequential scan on `title`?