Search code examples
oracle-databasenatural-sortvarchar2

Binary sort on alphanumeric text not behaving as natural sort


For last couple of days I've been trying to sort a list of alphanumeric text in a natural order.I found that using the NLS_SORT option can order the list correctly (see this answer). But when trying out that solution I found that it made no difference. The list was still displayed as with a normal ORDER BY query. Please not that a solution involving regex is not an option for me.

For testing purposes I made a table and filled it with some data. When running SELECT name FROM test ORDER BY name ASC I get the following result:

enter image description here

As you see the the ordering is unnatural. It should be more like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

The solutions I tried involved setting the nls_sort option.

ALTER SESSION SET nls_sort='BINARY'; -- or BINARY_AI
SELECT name FROM test ORDER BY NLSSORT(name,'NLS_SORT=BINARY') -- or BINARY_AI

It should order the text in the list based on the decimal code of each character as stated in the ASCII table. So I expected it to turn out the right way (as the order in that table is 'space', 'dot', numbers, letters), but it did not change anything. The order is still the same as in the image.

If it is BINARY then the sort order is based on the numeric value of each character, so it's dependant on the database character set

It might have something to do with the character set I'm using, but I'm not sure what wrong with it. Running SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET'; gives me the value AL32UTF8. Which seems like a slightly extended version of UTF8 (correct me if I'm wrong). I'm running on Oracle database version 11.2.0.4.0.

So can anyone tell me what I'm doing wrong or what I'm missing?

Thanks in advance.


Solution

  • You seem to expect a binary sort to look at multiple characters at once. It doesn't. It effectively sorts by the first character (so everything starting with 1 comes before anything starting with 2); then by the second character (so a period comes before a 0) - which means it's correct that 1. comes before 10, but also that 10 (or 100000) comes before 2. You can't change that aspect of the sorting behaviour. In the earlier question you linked to, it looks like only the first character was numeric, which is a slightly different situation.

    From the documentation:

    When character values are compared linguistically for the ORDER BY clause, they are first transformed to collation keys and then compared like RAW values. The collation keys are generated either explicitly as specified in NLSSORT or implicitly using the same method that NLSSORT uses.

    You can see the byte order used for sorting:

    with t (name) as (
      select level - 1 || '. test' from dual connect by level < 13
      union all select '20. test' from dual
      union all select '100. test' from dual
    )
    select name, nlssort(name, 'NLS_SORT=BINARY') as sort_bytes
    from t
    order by name;
    
    NAME       SORT_BYTES         
    ---------- --------------------
    0. test    302E207465737400    
    1. test    312E207465737400    
    10. test   31302E207465737400  
    100. test  3130302E207465737400
    11. test   31312E207465737400  
    2. test    322E207465737400    
    20. test   32302E207465737400  
    3. test    332E207465737400    
    4. test    342E207465737400    
    5. test    352E207465737400    
    6. test    362E207465737400    
    7. test    372E207465737400    
    8. test    382E207465737400    
    9. test    392E207465737400  
    

    You can see that the raw NLSRORT results (collation keys) are in a logical order.

    If you don't want to use a regular expression you can use substr() and instr() to get the part before the period/space and convert that to a number; though that assumes the format is fixed:

    with t (name) as (
      select level - 1 || '. test' from dual connect by level < 13
      union all select '20. test' from dual
      union all select '100. test' from dual
    )
    select name
    from t
    order by to_number(substr(name, 1, instr(name, '. ') - 1)),
      substr(name, instr(name, '. '));
    
    NAME     
    ----------
    0. test   
    1. test   
    2. test   
    3. test   
    4. test   
    5. test   
    6. test   
    7. test   
    8. test   
    9. test   
    10. test  
    11. test  
    20. test  
    100. test 
    

    If there might not be a period/space you could check for that:

    select name
    from t
    order by case when instr(name, '. ') > 0 then to_number(substr(name, 1, instr(name, '. ') - 1)) else 0 end,
      case when instr(name, '. ') > 0 then substr(name, instr(name, '. ')) else name end;
    

    ... but you still have a problem if you had, say, two sentences in the name but the first can't be converted to a number. You could implement a 'safe' to_number() function that squashes an ORA-01722 if that happens.

    It would be simpler and safer to use a regular expression, e.g.:

    select name
    from t
    order by to_number(regexp_substr(name, '^\d+', 1)), name;