I'm using the following regex in Oracle DB under CLOB datatype column:
SELECT
CASE
WHEN COUNT(*) > 0 THEN 1
ELSE 0
END AS contains_extended_unicode
FROM my_table
WHERE REGEXP_LIKE(TO_CHAR(my_column), '[^[:print:]\r\n\t]');
The query should return as 0 when data is purely with printable ASCII characters from keyboard. If any characters beyond, i.e •, the bullet list from MS Word will be treated as extended Unicode character, and query return as 1.
Currently, if the data contains ASCII characters with more than 1 line, Oracle still treat as extended Unicode character, thus query return as 1.
Sample data:
qwerty
qwerty
Oracle does not support PERL-like shorthand character classes in regular expression character classes.
In a regular expression character class, Oracle does not evaluate \n\r\t
as newline, carriage return and tab but as those literal characters. So [^[:print:]\r\n\t]
will match all characters that are not either visible characters (the POSIX character class [:print:]
) or \
or r
or \
or n
or \
or t
.
What you want to do is use the CHR
function to generate the characters and concatenate them into your regular expression:
SELECT CASE
WHEN COUNT(*) > 0 THEN 1
ELSE 0
END AS contains_extended_unicode
FROM my_table
WHERE REGEXP_LIKE(
my_column,
'[^[:print:]' || CHR(10) || CHR(13) || CHR(9) || ']'
);
If you want to match any character that is not an ASCII printable character (ASCII Codes 9, 10, 13 and 32-127) then you can use:
SELECT CASE
WHEN COUNT(*) > 0 THEN 1
ELSE 0
END AS contains_extended_unicode
FROM my_table
WHERE REGEXP_LIKE(
my_column,
'[^' || CHR(32) || '-' || CHR(127) || CHR(10) || CHR(13) || CHR(9) || ']'
);