Search code examples
sqlpostgresqlplpgsqlpostgresql-performance

To sort the letters in a string alphabetically in PostgreSQL


I am currently using this method to sort letters in a string alphabetically in PostgreSQL. Are there any other efficient methods?

select string_agg(c, '') as s
from   (select unnest(regexp_split_to_array('ijsAafhareDbv', '')) as c 
        order  by c) as t; 

       s   
 --------------
 ADaabefhijrsv

Solution

  • A function implemented in C is substantially faster than anything we can achieve with LANGUAGE sql or plpgsql. So your plpythonu function wins the performance contest by a long shot.

    But plpythonu is an untrusted procedural language. It's not installed by default and only superusers can create functions in untrusted languages. You need to be aware of security implications. And untrusted languages are not available at all on most cloud services.
    The current manual (quote from pg 10):

    PL/Python is only available as an “untrusted” language, meaning it does not offer any way of restricting what users can do in it and is therefore named plpythonu. A trusted variant plpython might become available in the future if a secure execution mechanism is developed in Python. The writer of a function in untrusted PL/Python must take care that the function cannot be used to do anything unwanted, since it will be able to do anything that could be done by a user logged in as the database administrator. Only superusers can create functions in untrusted languages such as plpythonu.

    The SQL functions you tested are not well optimized. There are a thousand and one ways to improve performance, yet:

    Demo

    -- func to create random strings
    CREATE OR REPLACE FUNCTION f_random_string(int)
      RETURNS text AS
    $func$
    SELECT array_to_string(ARRAY(
       SELECT substr('0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', (ceil(random()*62))::int, 1)
       FROM generate_series(1, $1)
       ), '')
    $func$  LANGUAGE sql VOLATILE;
    
    -- test tbl with 100K rows
    CREATE TABLE tbl(str text);
    INSERT INTO tbl
    SELECT f_random_string(15)
    FROM   generate_series(1, 100000) g;
    
    VACUUM ANALYZE tbl;
    
    -- 1: your test function 1 (inefficient)
    CREATE OR REPLACE FUNCTION sort1(text)  RETURNS text AS
    $func$  -- your test function 1 (very inefficient)
    SELECT string_agg(c, '')
    FROM  (SELECT unnest(regexp_split_to_array($1, '')) AS c ORDER  BY c) t;
    $func$ LANGUAGE sql IMMUTABLE;
    
    -- 2: your test function 2 ( inefficient)
    CREATE OR REPLACE FUNCTION sort2(text)  RETURNS text AS
    $func$
    WITH t(s) AS (VALUES ($1))
    SELECT string_agg(substr(t.s, g.g, 1), '' ORDER BY substr(t.s, g.g, 1))
    FROM   t
    CROSS  JOIN LATERAL generate_series(1, length(t.s)) g;
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 3: remove pointless CTE from sort2
    CREATE OR REPLACE FUNCTION sort3(text)  RETURNS text AS
    $func$
    SELECT string_agg(substr($1, g, 1), '' ORDER BY substr($1, g, 1))
    FROM   generate_series(1, length($1)) g;
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 4: use unnest instead of calling substr N times
    CREATE OR REPLACE FUNCTION sort4(text)  RETURNS text AS
    $func$
    SELECT string_agg(c, '' ORDER BY c)
    FROM   unnest(string_to_array($1, NULL)) c
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 5: ORDER BY in subquery
    CREATE OR REPLACE FUNCTION sort5(text)  RETURNS text AS
    $func$
    SELECT string_agg(c, '')
    FROM  (
       SELECT c
       FROM   unnest(string_to_array($1, NULL)) c
       ORDER  BY c
       ) sub
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 6: SRF in SELECT list
    CREATE OR REPLACE FUNCTION sort6(text)  RETURNS text AS
    $func$
    SELECT string_agg(c, '')
    FROM  (SELECT unnest(string_to_array($1, NULL)) c ORDER BY 1) sub
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 7: ARRAY constructor instead of aggregate func
    CREATE OR REPLACE FUNCTION sort7(text)  RETURNS text AS
    $func$
    SELECT array_to_string(ARRAY(SELECT unnest(string_to_array($1, NULL)) c ORDER BY c), '')
    $func$  LANGUAGE sql IMMUTABLE;
    
    -- 8: The same with COLLATE "C"
    CREATE OR REPLACE FUNCTION sort8(text)  RETURNS text AS
    $func$
    SELECT array_to_string(ARRAY(SELECT unnest(string_to_array($1 COLLATE "C", NULL)) c ORDER BY c), '')
    $func$  LANGUAGE sql IMMUTABLE;
    
    SELECT str, sort1(str), sort2(str), sort3(str), sort4(str), sort5(str), sort6(str), sort7(str), sort8(str) FROM tbl LIMIT 1;  -- result sample 
    
    str             | sort1           | sort2           | sort3           | sort4           | sort5           | sort6           | sort7           | sort8          
    :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :-------------- | :--------------
    tUkmori4D1rHhI1 | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DhHiIkmorrtU | 114DHIUhikmorrt
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort1(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.053 ms                                                                  |
    | Execution time: 2742.904 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort2(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.105 ms                                                                  |
    | Execution time: 2579.397 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort3(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.079 ms                                                                  |
    | Execution time: 2191.228 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort4(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.075 ms                                                                  |
    | Execution time: 2194.780 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort5(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.083 ms                                                                  |
    | Execution time: 1902.829 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort6(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.075 ms                                                                  |
    | Execution time: 1866.407 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort7(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.067 ms                                                                  |
    | Execution time: 1863.713 ms                                                              |
    
    EXPLAIN (ANALYZE, TIMING OFF) SELECT sort8(str) FROM tbl;
    
    | QUERY PLAN                                                                               |
    | :--------------------------------------------------------------------------------------- |
    | Seq Scan on tbl  (cost=0.00..26541.00 rows=100000 width=32) (actual rows=100000 loops=1) |
    | Planning time: 0.074 ms                                                                  |
    | Execution time: 1569.376 ms                                                              |
    

    db<>fiddle here

    The last one sorts without COLLATION rules, that's strictly by byte values of characters, which is substantially cheaper. But you may or may not need the sort order for a different locale.

    The manual about COLLATION expressions.