I am having an issue with the following PostgreSQL query it takes more than 10 seconds to run is there any way to speed this query up to a rational speed, I am simply looking for the most relevant search terms associated with videos on a very large database.
SELECT count(*), videoid
FROM term_search
where word = 'tester'
OR word = 'question'
OR word = 'one'
group by videoid
order by count(*) desc
limit 1800;
When the query is run with analyze the resultant query plan is as follows (http://explain.depesz.com/s/yDJ):
Limit (cost=389625.50..389630.00 rows=1800 width=4) (actual time=11766.693..11770.001 rows=1800 loops=1)
Output: (count(*)), videoid
-> Sort (cost=389625.50..389692.68 rows=26873 width=4) (actual time=11766.689..11767.818 rows=1800 loops=1)
Output: (count(*)), videoid
Sort Key: (count(*))
Sort Method: top-N heapsort Memory: 181kB
-> HashAggregate (cost=387769.41..388038.14 rows=26873 width=4) (actual time=9215.653..10641.993 rows=1632578 loops=1)
Output: count(*), videoid
-> Bitmap Heap Scan on public.term_search (cost=44915.83..378163.38 rows=1921207 width=4) (actual time=312.449..7026.036 rows=2047691 loops=1)
Output: id, videoid, word, termindex, weight
Recheck Cond: (((term_search.word)::text = 'tester'::text) OR ((term_search.word)::text = 'question'::text) OR ((term_search.word)::text = 'one'::text))
Rows Removed by Index Recheck: 25512434
-> BitmapOr (cost=44915.83..44915.83 rows=1950031 width=0) (actual time=288.937..288.937 rows=0 loops=1)
-> Bitmap Index Scan on terms_word_idx (cost=0.00..8552.83 rows=383502 width=0) (actual time=89.266..89.266 rows=419750 loops=1)
Index Cond: ((term_search.word)::text = 'tester'::text)
-> Bitmap Index Scan on terms_word_idx (cost=0.00..13171.84 rows=590836 width=0) (actual time=89.700..89.700 rows=604348 loops=1)
Index Cond: ((term_search.word)::text = 'question'::text)
-> Bitmap Index Scan on terms_word_idx (cost=0.00..21750.26 rows=975693 width=0) (actual time=109.964..109.964 rows=1023593 loops=1)
Index Cond: ((term_search.word)::text = 'one'::text)
The schema for the table is as follows:
Column | Type | Modifiers | Storage | Description
-----------+------------------------+----------------------------------------------------------+----------+-------------
id | integer | not null default nextval('term_search_id_seq'::regclass) | plain |
videoid | integer | | plain |
word | character varying(100) | | extended |
termindex | character varying(15) | | extended |
weight | smallint | | plain |
Indexes:
"term_search_pkey" PRIMARY KEY, btree (id)
"search_term_exists_idx" btree (videoid, word)
"terms_caverphone_idx" btree (termindex)
"terms_video_idx" btree (videoid)
"terms_word_idx" btree (word, videoid)
Foreign-key constraints:
"term_search_videoid_fkey" FOREIGN KEY (videoid) REFERENCES videos(id) ON DELETE CASCADE
Has OIDs: no
I have managed to get it down to 7 seconds with Index Only scans but it was still not low enough. I am running PostgreSQL 9.3 on Ubuntu 14.04 on an aws r3.xlarge instance, with approx 50 million rows in the table. Any advice is greatly appreciated!
EDIT:
Attached is the result of SELECT schemaname,tablename,attname,null_frac,avg_width,n_distinct FROM pg_stats WHERE schemaname='public' and tablename='term_search';
schemaname | tablename | attname | null_frac | avg_width | n_distinct
------------+-------------+-----------+-----------+-----------+------------
public | term_search | id | 0 | 4 | -1
public | term_search | videoid | 0 | 4 | 568632
public | term_search | word | 0 | 6 | 5054
public | term_search | termindex | 0 | 11 | 2485
public | term_search | weight | 0 | 2 | 3
If I have a chance to disconnect users for a night I would:
words
from term_search
,word
, something like this:
create table words (
word_id serial primary key,
word text);
insert into words (word)
select distinct word
from term_search;
alter table term_search add column word_id integer;
update term_search t
set word_id = w.word_id
from words w
where t.word = w.word;
alter table term_search add constraint term_search_word_fkey
foreign key (word_id) references words (word_id);
Test:
SELECT count(*), videoid
FROM term_search t
JOIN words w on t.word_id = w.word_id
WHERE w.word = 'tester'
OR w.word = 'question'
OR w.word = 'one'
GROUP BY videoid
ORDER BY count(*) desc
LIMIT 1800;
-- if was faster then
alter table term_search drop column word;
-- and on the fly...
alter table term_search alter termindex type text;
After the revolution I'd have to take care of inserts and updates on term_search
. I'd probably create a view with rules for insert and update.