Search code examples
sqlpostgresqlgroup-byexplainsql-execution-plan

postgres group by integer type columns faster than character type columns?


I have 4 tables which are

create table web_content_3 ( content integer, hits bigint, bytes bigint, appid varchar(32)  );
create table web_content_4 ( content character varying (128 ), hits bigint, bytes bigint, appid varchar(32)  );
create table web_content_5 ( content character varying (128 ), hits bigint, bytes bigint, appid integer );
create table web_content_6 ( content integer, hits bigint, bytes bigint, appid integer );

i am using same query for group by Approx 2 Millions records i.e. SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_{3,4,5,6} GROUP BY content,appid; Result is:

 - Table Name    | Content   | appid     | Time Taken [In ms]
 - ===========================================================
 - web_content_3 | integer   | Character | 27277.931
 - web_content_4 | Character | Character | 151219.388
 - web_content_5 | Character | integer   | 127252.023
 - web_content_6 | integer   | integer   | 5412.096

Here web_content_6 query taking around 5secs only compare to other three combination, using this statistics we can say that integer, integer combination for group by is much faster but Question is WHY?

I have EXPLAIN Results also but it does give me any explanation of Drastic change between web_content_4 and web_content_6 query.

here it is.

test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_4 GROUP BY content,appid;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=482173.36..507552.31 rows=17680 width=63) (actual time=138099.612..151565.655 rows=17680 loops=1)
   ->  Sort  (cost=482173.36..487196.11 rows=2009100 width=63) (actual time=138099.202..149256.707 rows=2009100 loops=1)
         Sort Key: content, appid
         Sort Method:  external merge  Disk: 152488kB
         ->  Seq Scan on web_content_4  (cost=0.00..45218.00 rows=2009100 width=63) (actual time=0.010..349.144 rows=2009100 loops=1)
 Total runtime: 151613.569 ms
(6 rows)

Time: 151614.106 ms

test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_6 GROUP BY content,appid;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=368814.36..394194.51 rows=17760 width=24) (actual time=3282.333..5840.953 rows=17760 loops=1)
   ->  Sort  (cost=368814.36..373837.11 rows=2009100 width=24) (actual time=3282.176..3946.025 rows=2009100 loops=1)
         Sort Key: content, appid
         Sort Method:  external merge  Disk: 74632kB
         ->  Seq Scan on web_content_6  (cost=0.00..34864.00 rows=2009100 width=24) (actual time=0.011..297.235 rows=2009100 loops=1)
 Total runtime: 6172.960 ms

Solution

  • Gordon Linoff is right, of course. Spilling over to disk is expensive.

    If you can spare the memory, you can tell PostgreSQL to use more for sorting and such. I built a table, populated it with random data, and analyzed it before running this query.

    EXPLAIN ANALYSE 
    SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid 
    from web_content_4 
    GROUP BY content,appid;
    
    "GroupAggregate  (cost=364323.43..398360.86 rows=903791 width=96) (actual time=25059.086..29789.234 rows=1998067 loops=1)"
    "  ->  Sort  (cost=364323.43..369323.34 rows=1999961 width=96) (actual time=25057.540..27907.143 rows=2000000 loops=1)"
    "        Sort Key: content, appid"
    "        Sort Method: external merge  Disk: 216016kB"
    "        ->  Seq Scan on web_content_4  (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.010..475.187 rows=2000000 loops=1)"
    "Total runtime: 30012.427 ms"
    

    I get the same execution plan you did. In my case, this query does an external merge sort that requires about 216MB of disk. I can tell PostgreSQL to allow more memory for this query by setting the value of work_mem. (Setting work_mem this way affects only my current connection.)

    set work_mem = '250MB';
    EXPLAIN ANALYSE 
    SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid 
    from web_content_4 
    GROUP BY content,appid;
    
    "HashAggregate  (cost=72472.22..81510.13 rows=903791 width=96) (actual time=3196.777..4505.290 rows=1998067 loops=1)"
    "  ->  Seq Scan on web_content_4  (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.019..437.252 rows=2000000 loops=1)"
    "Total runtime: 4726.401 ms"
    

    Now PostgreSQL is using a hash aggregate, and execution time dropped by a factor of 6, 30 seconds to 5 seconds.


    I didn't test web_content_6, because replacing text with integers will usually require a couple of joins to recover the text. So I'm not sure we'd be comparing apples to apples there.