Search code examples
sqlpostgresqlsqlperformancepostgresql-performance

postgres large table select optimization


I have to extract DB to external DB server for licensed software. DB has to be Postgres and I cannot change select query from application (cannot change source code).

Table (it has to be 1 table) holds around 6,5M rows and has unique values in main column (prefix).

All requests are read request, no inserts/update/delete, and there are ~200k selects/day with peaks of 15 TPS.

Select query is:

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table 
WHERE '00436641997142' LIKE prefix 
AND company = 0  and ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )  
ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC 
LIMIT 1;

Explain analyze shows following

Limit  (cost=406433.75..406433.75 rows=1 width=113) (actual time=1721.360..1721.361 rows=1 loops=1)
  ->  Sort  (cost=406433.75..406436.72 rows=1188 width=113) (actual time=1721.358..1721.358 rows=1 loops=1)
        Sort Key: ("position"((prefix)::text, '%'::text)), (char_length(prefix)) DESC
        Sort Method: quicksort  Memory: 25kB
        ->  Seq Scan on table  (cost=0.00..406427.81 rows=1188 width=113) (actual time=1621.159..1721.345 rows=1 loops=1)
              Filter: ((company = 0) AND ('00381691997142'::text ~~ (prefix)::text) AND ((strpos(("Day")::text, (to_char(now(), 'ID'::text))::text) > 0) OR ("Day" IS NULL)) AND (((('now'::cstring)::time with time zone >= (timefrom)::time with time zone) AN (...)
              Rows Removed by Filter: 6417130
Planning time: 0.165 ms
Execution time: 1721.404 ms`

Slowest part of query is:

 SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table 
 WHERE '00436641997142' LIKE prefix 

which generates 1,6s (tested only this part of query)

Part of query tested separately:

Seq Scan on table  (cost=0.00..181819.07 rows=32086 width=113) (actual time=1488.359..1580.607 rows=1 loops=1)
  Filter: ('004366491997142'::text ~~ (prefix)::text)
  Rows Removed by Filter: 6417130
Planning time: 0.061 ms
Execution time: 1580.637 ms

About data itself: column "prefix" has identical first several digits (first 5) and rest are different, unique ones.

Postgres version is 9.5 I've changed following settings of Postgres:

random-page-cost = 40
effective_cashe_size = 4GB
shared_buffer = 4GB
work_mem = 1GB

I have tried with several index types (unique, gin, gist, hash), but in all cases indexes are not used (as stated in explain above) and result speed is same. I've also did, but no visible improvements:

vacuum analyze verbose table

Please recommend settings of DB and/or index configuration in order to speed up execution time of this query.

Current HW is i5, SSD, 16GB RAM on Win7, but I have option to buy stronger HW. As I understood, for cases where read (no inserts/updates) is dominant, faster CPU cores are much more important than number of cores or disk speed > please, confirm.

Add-on 1: After adding 9 indexes, index is not used also.

Add-on 2: 1) I found out reason for not using index: word order in query in part like is reason. if query would be:

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE prefix like '00436641997142%'
AND company = 0  and 
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )
 ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1

it uses index.

notice difference:

... WHERE '00436641997142%' like prefix ...

query which uses index correctly:

... WHERE prefix like '00436641997142%' ...

since I cannot change query itself, any idea how to overcome this? I can change data and Postgres settings, but not query itself.

2) Also, I intalled Postgres 9.6 version in order to use parallel seq.scan. In this case, parallel scan is used only if last part of query is ommited. So, query:

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM table WHERE '00436641997142' LIKE prefix 
AND company = 0  and 
((current_time between timefrom and timeto) or (timefrom is null and timeto is null))
 ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1

uses parallel mode.

Any idea how to force original query (I cannot change query):

SELECT prefix, changeprefix, deletelast, outgroup, tariff FROM erm_table WHERE '00436641997142' LIKE prefix 
AND company = 0  and 
((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )
 ORDER BY position('%' in prefix) ASC, char_length(prefix) DESC LIMIT 1

to use parallel seq. scan?


Solution

  • It's too hard to make an index for queries like strin LIKE pattern because wildcards (% and _) can stand everywhere.

    I can suggest one risky solution:

    1. Slightly redesign the table - make it indexable. Add two more column prefix_low and prefix_high of fixed width - for example char(32), or any arbitrary length enough for the task. Also add one smallint column for prefix length. Fill them with lowest and highest values matching prefix and prefix length. For example:

      select rpad(rtrim('00436641997142%','%'), 32, '0') AS prefix_low, rpad(rtrim('00436641997142%','%'), 32, '9') AS prefix_high, length(rtrim('00436641997142%','%')) AS prefix_length;
      
             prefix_low                 |               prefix_high             |   prefix_length
      ----------------------------------+---------------------------------------+-----
       00436641997142000000000000000000 | 00436641997142999999999999999999      |   14
      
    2. Make index with these values

      CREATE INDEX table_prefix_low_high_idx ON table (prefix_low, prefix_high);
      
    3. Check modified requests against table:

      SELECT prefix, changeprefix, deletelast, outgroup, tariff 
      FROM table 
      WHERE '00436641997142%' BETWEEN prefix_low AND prefix_high
        AND company = 0  
        AND ((current_time between timefrom and timeto) or (timefrom is null and timeto is null)) and (strpos("Day", cast(to_char(now(), 'ID') as varchar)) > 0  or "Day" is null )
      ORDER BY prefix_length DESC 
      LIMIT 1
      

      Check how well it works with indexes, try to tune it - add/remove index for prefix_length add it to between index and so on.

    4. Now you need to rewrite queries to database. Install PgBouncer and PgBouncer-RR patch. It allows you rewrite queries on-fly with easy python code like in example:

      import re
      
      def rewrite_query(username, query):
         q1=r"""^SELECT [^']*'(?P<id>\d+)%'[^'] ORDER BY (?P<position>\('%' in prefix\) ASC, char_length\(prefix\) LIMIT """
         if not re.match(q1, query):
            return query  # nothing to do with other queries
         else:
            new_query = # ... rewrite query here
         return new_query
      
    5. Run pgBouncer and connect it to DB. Try to issue different queries like your application does and check how they are getting rewrited. Because you deal with text you have to tweak regexps to match all required queries and rewrite them properly.

    6. When proxy is ready and debugged reconnect your application to pgBouncer.

    Pro:

    • no changes to application
    • no changes to basic structure of DB

    Contra:

    • extra maintenance - you need triggers to keep all new columns with actual data
    • extra tools to support
    • rewrite uses regexp so it's closely tied to particular queries issued by your app. You need to run it for some time and make robust rewrite rules.

    Further development: highjack parsed query tree in pgsql itself https://wiki.postgresql.org/wiki/Query_Parsing