For last couple of days I've been trying to sort a list of alphanumeric text in a natural order.I found that using the NLS_SORT option can order the list correctly (see this answer). But when trying out that solution I found that it made no difference. The list was still displayed as with a normal ORDER BY query. Please not that a solution involving regex is not an option for me.
For testing purposes I made a table and filled it with some data. When running SELECT name FROM test ORDER BY name ASC
I get the following result:
As you see the the ordering is unnatural. It should be more like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
.
The solutions I tried involved setting the nls_sort option.
ALTER SESSION SET nls_sort='BINARY'; -- or BINARY_AI
SELECT name FROM test ORDER BY NLSSORT(name,'NLS_SORT=BINARY') -- or BINARY_AI
It should order the text in the list based on the decimal code of each character as stated in the ASCII table. So I expected it to turn out the right way (as the order in that table is 'space', 'dot', numbers, letters), but it did not change anything. The order is still the same as in the image.
If it is BINARY then the sort order is based on the numeric value of each character, so it's dependant on the database character set
It might have something to do with the character set I'm using, but I'm not sure what wrong with it. Running SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET';
gives me the value AL32UTF8
. Which seems like a slightly extended version of UTF8 (correct me if I'm wrong). I'm running on Oracle database version 11.2.0.4.0.
So can anyone tell me what I'm doing wrong or what I'm missing?
Thanks in advance.
You seem to expect a binary sort to look at multiple characters at once. It doesn't. It effectively sorts by the first character (so everything starting with 1 comes before anything starting with 2); then by the second character (so a period comes before a 0) - which means it's correct that 1.
comes before 10
, but also that 10
(or 100000) comes before 2
. You can't change that aspect of the sorting behaviour. In the earlier question you linked to, it looks like only the first character was numeric, which is a slightly different situation.
When character values are compared linguistically for the
ORDER BY
clause, they are first transformed to collation keys and then compared likeRAW
values. The collation keys are generated either explicitly as specified inNLSSORT
or implicitly using the same method thatNLSSORT
uses.
You can see the byte order used for sorting:
with t (name) as (
select level - 1 || '. test' from dual connect by level < 13
union all select '20. test' from dual
union all select '100. test' from dual
)
select name, nlssort(name, 'NLS_SORT=BINARY') as sort_bytes
from t
order by name;
NAME SORT_BYTES
---------- --------------------
0. test 302E207465737400
1. test 312E207465737400
10. test 31302E207465737400
100. test 3130302E207465737400
11. test 31312E207465737400
2. test 322E207465737400
20. test 32302E207465737400
3. test 332E207465737400
4. test 342E207465737400
5. test 352E207465737400
6. test 362E207465737400
7. test 372E207465737400
8. test 382E207465737400
9. test 392E207465737400
You can see that the raw NLSRORT
results (collation keys) are in a logical order.
If you don't want to use a regular expression you can use substr()
and instr()
to get the part before the period/space and convert that to a number; though that assumes the format is fixed:
with t (name) as (
select level - 1 || '. test' from dual connect by level < 13
union all select '20. test' from dual
union all select '100. test' from dual
)
select name
from t
order by to_number(substr(name, 1, instr(name, '. ') - 1)),
substr(name, instr(name, '. '));
NAME
----------
0. test
1. test
2. test
3. test
4. test
5. test
6. test
7. test
8. test
9. test
10. test
11. test
20. test
100. test
If there might not be a period/space you could check for that:
select name
from t
order by case when instr(name, '. ') > 0 then to_number(substr(name, 1, instr(name, '. ') - 1)) else 0 end,
case when instr(name, '. ') > 0 then substr(name, instr(name, '. ')) else name end;
... but you still have a problem if you had, say, two sentences in the name but the first can't be converted to a number. You could implement a 'safe' to_number()
function that squashes an ORA-01722 if that happens.
It would be simpler and safer to use a regular expression, e.g.:
select name
from t
order by to_number(regexp_substr(name, '^\d+', 1)), name;