I have a database with a few tables, each has a few millions rows (tables do have indexes). I need to count rows in a table, but only those whose foreign key field points to a subset from another table.
Here is the query:
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE ( top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )) ))
SELECT Count(*)
FROM cast_info
WHERE EXISTS(SELECT 1
FROM filtered_title
WHERE cast_info.movie_id = filtered_title.id)
AND cast_info.role_id IN( 3, 8 )
I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries. But I have tried to get rid of CTE and the results were the same: the first time I execute the query it runs... runs... and runs for more than ten minutes. The second time I execute the query, it's down to 4 seconds, which is acceptable for me.
The result of EXPLAIN ANALYZE
:
Aggregate (cost=46194894.49..46194894.50 rows=1 width=0) (actual time=127728.452..127728.452 rows=1 loops=1)
CTE filtered_title
-> Seq Scan on title top (cost=0.00..46123542.41 rows=1430406 width=4) (actual time=732.509..1596.345 rows=16250 loops=1)
Filter: (((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[]))) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
Rows Removed by Filter: 2832906
SubPlan 1
-> Index Scan using title_idx_epof on title sub (cost=0.43..16.16 rows=1 width=0) (never executed)
Index Cond: (episode_of_id = top.id)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
SubPlan 2
-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
-> Nested Loop (cost=32184.70..63158.16 rows=3277568 width=0) (actual time=1620.382..127719.030 rows=29679 loops=1)
-> HashAggregate (cost=32184.13..32186.13 rows=200 width=4) (actual time=1620.058..1631.697 rows=16250 loops=1)
-> CTE Scan on filtered_title (cost=0.00..28608.12 rows=1430406 width=4) (actual time=732.513..1607.093 rows=16250 loops=1)
-> Index Scan using cast_info_idx_mid on cast_info (cost=0.56..154.80 rows=6 width=4) (actual time=5.977..7.758 rows=2 loops=16250)
Index Cond: (movie_id = filtered_title.id)
Filter: (role_id = ANY ('{3,8}'::integer[]))
Rows Removed by Filter: 15
Total runtime: 127729.100 ms
Now to my question. What am I doing wrong and how can I fix it?
I tried a few variants of the same query: exclusive joins, joins/exists. On one hand this one seems to require the least time to do the job (10x faster), but it's still 60 seconds on average. On the other hand, unlike my first query which needs 4-6 seconds on the second run, it always requires 60 seconds.
WITH filtered_title
AS (SELECT top.id
FROM title top
WHERE top.production_year >= 1982
AND top.production_year <= 1984
AND top.kind_id IN( 1, 2 )
OR EXISTS(SELECT 1
FROM title sub
WHERE sub.episode_of_id = top.id
AND sub.production_year >= 1982
AND sub.production_year <= 1984
AND sub.kind_id IN( 1, 2 )))
SELECT Count(*)
FROM cast_info
join filtered_title
ON cast_info.movie_id = filtered_title.id
WHERE cast_info.role_id IN( 3, 8 )
Disclaimer: There are way too many factors in play to give a conclusive answer. The information with a few tables, each has a few millions rows (tables do have indexes)
just doesn't cut it. It depends on cardinalities, table definitions, data types, usage patterns and (probably most important) indexes. And a proper basic configuration of your db server, of course. All of this goes beyond the scope of a single question on SO. Start with the links in the postgresql-performance tag. Or hire a professional.
I am going to address the most prominent detail (for me) in your query plan:
title
?-> Seq Scan on title sub_1 (cost=0.00..90471.23 rows=11657 width=4) (actual time=0.071..730.311 rows=16250 loops=1)
Filter: ((production_year >= 1982) AND (production_year <= 1984) AND (kind_id = ANY ('{1,2}'::integer[])))
Rows Removed by Filter: 2832906
Bold emphasis mine. Sequentially scanning 3 million rows to retrieve just 16250 is not very efficient. The sequential scan is also the likely reason why the first run takes so much longer. Subsequent calls can read data from the cache. Since the table is big, data probably won't stay in the cache for long unless you have huge amounts of cache.
An index scan is typically substantially faster to gather 0.5 % of the rows from a big table. Possible causes:
No matching index.
My money is on the index. You didn't supply your version of Postgres, so assuming current 9.3. The perfect index for this query would be:
CREATE INDEX title_foo_idx ON title (kind_id, production_year, id, episode_of_id)
Data types matter. Sequence of columns in the index matters.
kind_id
first, because the rule of thumb is: index for equality first — then for ranges.
The last two columns (id, episode_of_id
) are only useful for a potential index-only scan. If not applicable, drop those. More details here:
PostgreSQL composite primary key
The way you built your query, you end up with two sequential scans on the big table. So here's an educated guess for a ...
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
, t_all AS (
SELECT id FROM t_base
UNION -- not UNION ALL (!)
SELECT id
FROM (SELECT DISTINCT episode_of_id AS id FROM t_base) x
JOIN title t USING (id)
)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_all t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
This should give you one index scan on the new title_foo_idx
, plus another index scan on the pk index of title
. The rest should be relatively cheap. With any luck, much faster than before.
kind_id BETWEEN 1 AND 2
.. as long as you have a continuous range of values, that is faster than listing individual values because this way Postgres can fetch a continuous range from the index. Not very important for just two values.
Test this alternative for the second leg of t_all
. Not sure which is faster:
SELECT id
FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id)
You write:
I use CTE because there are more COUNT queries down there for other tables, which use the same subqueries.
A CTE poses as optimization barrier, the resulting internal work table is not indexed. When reusing a result (with more than a trivial number of rows) multiple times, it pays to use an indexed temp table instead. Index creation for a simple int column is fast.
CREATE TEMP TABLE t_tmp AS
WITH t_base AS (
SELECT id, episode_of_id
FROM title
WHERE kind_id BETWEEN 1 AND 2
AND production_year BETWEEN 1982 AND 1984
)
SELECT id FROM t_base
UNION
SELECT id FROM title t
WHERE EXISTS (SELECT 1 FROM t_base WHERE t_base.episode_of_id = t.id);
ANALYZE t_tmp; -- !
CREATE UNIQUE INDEX ON t_tmp (id); -- ! (unique is optional)
SELECT count(*) AS ct
FROM cast_info c
JOIN t_tmp t ON t.id = c.movie_id
WHERE c.role_id IN (3, 8);
-- More queries using t_tmp
About temp tables:
How to tell if record has changed in Postgres