Search code examples
postgresqlnatural-sort

PostgreSQL Query Incorrect Sorting Order for Alphanumeric Strings


I am trying to sort a column title from the table door with the following PostgreSQL query:

SELECT title
FROM door
ORDER BY
  CASE 
    WHEN title ~ '^\d+' THEN CAST(regexp_replace(title, '[^0-9]', '', 'g') AS INTEGER)
    ELSE NULL
  END ASC,
  CASE 
    WHEN title ~ '^\d+' THEN regexp_replace(title, '\d+', '', 'g')
    ELSE title
  END ASC;

The query currently produces the following result:

"DR-1"
"DR-01"
"DR-02"
"DR-03"
"DR-04"
"DR-100"
"DR-1001"
"DR-101"
"DR-102"
"DR-104"
"Entrance-1"
"Entrance-2"
"MY-Prefix-10-Entrance-1"
"MY-Prefix-11-Entrance-1"
"MY-Prefix-19-Entrance-1"
"MY-Prefix-1-Entrance-1"
"MY-Prefix-20-Entrance-1"
"MY-Prefix-2-Entrance-1"
"MY-Prefix-3-Entrance-1"
"MY-Prefix-8-Entrance-1"
"MY-Prefix-9-Entrance-1"
"MY-Prefix-9-Entrance-2"
"MY-Prefix-9-Entrance-3"

However, I was expecting the result to be:

"DR-1"
"DR-01"
"DR-02"
"DR-03"
"DR-04"
"DR-100"
"DR-101"
"DR-102"
"DR-104"
"DR-1001"
"Entrance-1"
"Entrance-2"
"MY-Prefix-1-Entrance-1"
"MY-Prefix-2-Entrance-1"
"MY-Prefix-3-Entrance-1"
"MY-Prefix-8-Entrance-1"
"MY-Prefix-9-Entrance-1"
"MY-Prefix-9-Entrance-2"
"MY-Prefix-9-Entrance-3"
"MY-Prefix-10-Entrance-1"
"MY-Prefix-11-Entrance-1"
"MY-Prefix-19-Entrance-1"
"MY-Prefix-20-Entrance-1"

The query is not correctly ordering the rows where the title column contains mixed prefixes and numeric segments, particularly the entries with the MY-Prefix pattern. I suspect this is due to the logic in the ORDER BY clause not properly handling the sorting of multi-segment alphanumeric strings.

I want to:

Sort rows where the numeric segment is extracted from title and ordered numerically. Handle alphanumeric prefixes correctly so that entries like MY-Prefix-1-Entrance-1 appear before MY-Prefix-10-Entrance-1.


Solution

  • That's one of the purposes of custom ICU collations. You can define a sorting method that treats numbers numerically, the way you wanted: demo at db<>fiddle

    CREATE COLLATION numerical( provider=icu
                               ,deterministic=false
                               ,locale='und-u-kn' );
    

    Then you just tell your order by to use it:

    SELECT title
    FROM door
    ORDER BY title COLLATE numerical;
    

    The demo shows it's also faster than trying to handle that using regex and conditionals.