postgresql full-text-search postgresql-9.6

Phrase frequency counter with FULL Text Search of PostgreSQL 9.6

I need to calculate the number of times that a phrase appears using ts_query against an indexed text field (ts_vector data type). It works but it is very slow because the table is huge. For single words I pre-calculated all the frequencies but I have no ideas for increasing the speed of a phrase search.

Edit: Thank you for your reply @jjanes.

This is my query:

SELECT substring(date_input::text,0,5) as myear, ts_headline('simple',text_input,q, 'StartSel=<b>, StopSel=</b>,MaxWords=2, MinWords=1, ShortWord=1, HighlightAll=FALSE, MaxFragments=9999, FragmentDelimiter=" ... "') as headline 
FROM 
db_test, to_tsquery('simple','united<->kingdom') as q WHERE date_input BETWEEN '2000-01-01'::DATE AND '2019-12-31'::DATE and idxfti_simple @@ q

And this is the EXPLAIN (ANALYZE, BUFFERS) output:

Nested Loop  (cost=25408.33..47901.67 rows=5509 width=64) (actual time=286.536..17133.343 rows=38127 loops=1)
Buffers: shared hit=224723
    ->  Function Scan on q  (cost=0.00..0.01 rows=1 width=32) (actual time=0.005..0.007 rows=1 loops=1)
    ->  Append  (cost=25408.33..46428.00 rows=5510 width=625) (actual time=285.372..864.868 rows=38127 loops=1)
        Buffers: shared hit=165713
        ->  Bitmap Heap Scan on db_test  (cost=25408.33..46309.01 rows=5509 width=625) (actual time=285.368..791.111 rows=38127 loops=1)
            Recheck Cond: ((idxfti_simple @@ q.q) AND (date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
            Rows Removed by Index Recheck: 136
            Heap Blocks: exact=29643
            Buffers: shared hit=165607
                ->  BitmapAnd  (cost=25408.33..25408.33 rows=5509 width=0) (actual time=278.370..278.371 rows=0 loops=1)
                Buffers: shared hit=3838
                    ->  Bitmap Index Scan on idxftisimple_idx  (cost=0.00..1989.01 rows=35869 width=0) (actual time=67.280..67.281 rows=176654 loops=1)
                        Index Cond: (idxfti_simple @@ q.q)
                        Buffers: shared hit=611
                    ->  Bitmap Index Scan on db_test_date_input_idx  (cost=0.00..23142.24 rows=1101781 width=0) (actual time=174.711..174.712 rows=1149456 loops=1)
                        Index Cond: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date))
                        Buffers: shared hit=3227
        ->  Seq Scan on test  (cost=0.00..118.98 rows=1 width=451) (actual time=0.280..0.280 rows=0 loops=1)
            Filter: ((date_input >= '2000-01-01'::date) AND (date_input <= '2019-12-31'::date) AND (idxfti_simple @@ q.q))
            Rows Removed by Filter: 742
            Buffers: shared hit=106

Planning time: 0.332 ms
Execution time: 17176.805 ms

Sorry, I can't set track_io_timing turned on. I do know that ts_headline is not recommended but I need it to calculate the number of times that a phrase appears on the same field.

Thank you in advance for your help.

Solution

Note that fetching the rows in Bitmap Heap Scan is quite fast, <0.8 seconds, and almost all the time is spent in the top-level node. That time is likely to be spent in ts_headline, reparsing the text_input document. As long as you keep using ts_headline, there isn't much you can do about this.

ts_headline doesn't directly give you what you want (frequency), so you must be doing some kind of post-processing of it. Maybe you could move to postprocessing the tsvector directly, so the document doesn't need to be reparsed.

Another option is to upgrade further, which could allow the work of ts_headline to be spread over multiple CPUs. PostgreSQL 9.6 was the first version which supported parallel query, and it was not mature enough in that version to be able to parallelize this type of thing. v10 is probably enough to get parallel query for this, but you might as well jump all the way to v12.