rexp_like - can't seem to understand how it works

I got a table with a "token" column that is supposed to only contain letter, this gets enforced so far with a simple regexp_like constraint but recently i noticed a particular problem: the letter 'a' (lowercase A) is not allowed. all other standard ASCII letter work as far as i can tell.

Database is:

Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit
Production PL/SQL Release 12.1.0.2.0 - Production

SELECT CHR(num), num
FROM (
  SELECT LEVEL AS num FROM dual CONNECT BY LEVEL <= 200
) WHERE REGEXP_LIKE(CHR(num), '^[A-Z]+$');

using a matching-pattern 'c' or 'i', just using [a-z] or other variations give the correct result, but just the simple [A-Z] omits the 'a' and only that letter.

Edit: Yes, using [a-zA-Z] would cover all characters, same as using REGEXP_LIKE(CHR(num), '^[A-Z]+$', 'i') ('i' is the match_pattern for case-insensitive), but the question is why is just the lowercase 'a' omitted here. We generate some other strings and constants from those tokens and have similar constraints all over our database so i do not just want to fix this problem but also understand what caused that.

Solution

Using [A-Z] should only include uppercase characters, and exclude all lower-case ones. At least, that's what it should - and does - do in English, with binary sorting:

alter session set nls_language = 'ENGLISH';

select * from nls_session_parameters where parameter = 'NLS_SORT';

PARAMETER                      VALUE                                   
------------------------------ ----------------------------------------
NLS_SORT                       BINARY                                  

SELECT CHR(num), num
FROM (
  SELECT LEVEL AS num FROM dual CONNECT BY LEVEL <= 200
) WHERE REGEXP_LIKE(CHR(num), '^[A-Z]+$');

C        NUM
- ----------
A         65
...
Z         90

26 rows selected.

In German (or any other language that modifies the default sort order) it does what I think you're describing:

alter session set nls_language = 'GERMAN';

select * from nls_session_parameters where parameter = 'NLS_SORT';

PARAMETER                      VALUE                                   
------------------------------ ----------------------------------------
NLS_SORT                       GERMAN                                  

SELECT CHR(num), num
FROM (
  SELECT LEVEL AS num FROM dual CONNECT BY LEVEL <= 200
) WHERE REGEXP_LIKE(CHR(num), '^[A-Z]+$');

C        NUM
- ----------
A         65
...
X         88
Y         89
Z         90
b         98
c         99
d        100
...
z        122

Switching to character classes makes it behave more consistently:

alter session set nls_language = 'GERMAN';

SELECT CHR(num), num
FROM (
  SELECT LEVEL AS num FROM dual CONNECT BY LEVEL <= 200
) WHERE REGEXP_LIKE(CHR(num), '^[[:upper:]]+$');

C        NUM
- ----------
A         65
...
Z         90

26 rows selected.

If you want to include all upper and lowercase characters then you can use [:alpha:] instead of [:upper:]. Note that either of these will include diacritics (and lower or alpha will include 'ß'); which may or may not be desirable:

SELECT character
FROM (
  SELECT CHR(LEVEL + 64) AS character
  FROM dual CONNECT BY LEVEL <= 58
  UNION ALL SELECT column_value
  FROM TABLE(sys.odcivarchar2list('ß', 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü'))
)
WHERE REGEXP_LIKE(character, '^[[:upper:]]+$')
ORDER BY character;

CHARACTER
---------
A
Ä
B
...

This is explained in the documentation.

Traditional regular expression engines were designed to address only English text. However, regular expression implementations can encompass a wide variety of languages with characteristics that are very different from western European text. The implementation of regular expressions in Oracle Database is based on the Unicode Regular Expression Guidelines. The REGEXP SQL functions work with all character sets that are supported as database character sets and national character sets. Moreover, Oracle Database enhances the matching capabilities of the POSIX regular expression constructs to handle the unique linguistic requirements of matching multilingual data.

and

According to the POSIX standard, a range in a regular expression includes all collation elements between the start point and the end point of the range in the linguistic definition of the current locale. Therefore, ranges in regular expressions are meant to be linguistic ranges, not byte value ranges ...

... Oracle Database interprets range expressions as specified by the NLS_SORT parameter to determine the collation elements covered by a given range

Linguistic collation is essentially based on how languages behave, or are expected to behave by their users.

If you have a binary sort then characters are ordered based purely on the code point (ASCII/Unicode value) of each character; 'a' is 97, which is obviously after (numerically higher than) 'A' which is 65. If you select all the characters and use a binary sort you see:

alter session set nls_language = 'GERMAN';
alter session set nls_sort = 'BINARY';

SELECT CHR(LEVEL + 64) AS character, NLSSORT(CHR(LEVEL + 64)) AS sort
FROM dual CONNECT BY LEVEL <= 58
UNION ALL SELECT column_value, NLSSORT(column_value)
FROM TABLE(sys.odcivarchar2list('ß', 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü'))
ORDER BY character;

CHARACTER SORT     
--------- ---------
A         4100     
B         4200     
C         4300     
...
X         5800     
Y         5900     
Z         5A00     
[         5B00     
\         5C00     
]         5D00     
^         5E00     
_         5F00     
`         6000     
a         6100     
b         6200     
c         6300    
... 
x         7800     
y         7900     
z         7A00     
Ä         C38400   
Ö         C39600   
Ü         C39C00   
ß         C39F00   
ä         C3A400   
ö         C3B600   
ü         C3BC00   

65 rows selected.

The nlssort() function lets you see the actual value Oracle is using for the sorting - here it's the binary value for the characters, so 'A' is still 65 (well, 41 as it's in hex) plus a null byte. When you order by those 'a' is still after 'A'.

With linguistic collation those sort values are completely different:

alter session set nls_sort = 'GERMAN';

SELECT CHR(LEVEL + 64) AS character, NLSSORT(CHR(LEVEL + 64)) AS sort
FROM dual CONNECT BY LEVEL <= 58
UNION ALL SELECT column_value, NLSSORT(column_value)
FROM TABLE(sys.odcivarchar2list('ß', 'ä', 'ö', 'ü', 'Ä', 'Ö', 'Ü'))
ORDER BY character;

CHARACTER SORT     
--------- ---------
[         00005B00 
\         00005C00 
]         00005D00 
^         00005E00 
_         00005F00 
`         00006000 
a         14000100 
A         14000200 
ä         14000900 
Ä         14000A00 
b         19000100 
B         19000200 
c         1E000100 
C         1E000200 
...
r         64000100 
R         64000200 
s         69000100 
S         69000300 
ß         69004400 
t         6E000100 
T         6E000200 
u         73000100 
U         73000200 
ü         73000900 
Ü         73000A00 
v         78000100 
V         78000200 
w         7A000100 
W         7A000200 
x         7D000100 
X         7D000200 
y         82000100 
Y         82000200 
z         87000100 
Z         87000200 

65 rows selected.

The 'linguistic' part means that all variations of 'a' are together - 'a', 'A', 'ä' and 'Ä', in that order. Then all the 'b' variations, etc., with language-specific characters grouped with their generic relations. There is much more about this in the documentation, including examples.

The behaviour you're seeing is explained by that ordering. Look again at the start of that sorted list:

CHARACTER SORT     
--------- ---------
...
a         14000100 
A         14000200 
ä         14000900 
Ä         14000A00 
b         19000100 
B         19000200 
c         1E000100 
C         1E000200

If you try to use a regular expression with a range of [A-Z] then it starts at 'A', but in this collation order that is after 'a' - so that lower-case 'a' is excluded. Similarly, if you use [B-Z] then you start from 'B', which is after all the 'a' variants and lowercase 'b', so that is now excluded.