I have an Oracle table with a description
column whose type is CLOB. The column can contain text in whatever language. I'm interested in retrieving all the rows that contains at least one chinese character in that description
column.
How can I do?
Some resources that I looked at:
You can use a regular expression function like regexp_instr()
, and use ranges of characters you want to search for.
If you have data like:
ID | DESCRIPTION |
---|---|
1 | Just ASCII |
2 | With Chinese 好世界 text |
3 | With Russian Привет, мир text |
then using the ranges from this answer you can do:
select *
from your_table
where regexp_instr(description, unistr('[\2E80-\2FD5\3190-\319f\3400-\4DBF\4E00-\9FCC\F900-\FAAD]')) > 0
ID | DESCRIPTION |
---|---|
2 | With Chinese 好世界 text |
As discussed in the various answers on the linked question, these are CJK ranges so not strictly just Chinese, and you may need to include additional extension ranges, but... might be good enough for what you're doing, or at least a starting point you can work from.
Also notice that I've used the unistr()
function to convert the ranges to something that can be recognised as a regex pattern, and that doesn't need the u
prefixes in the linked answer - so it takes for example \2E80-\2FD5
rather than u\2E80-u\2FD5
.
You could also possibly use Oracle Text's CHINESE_LEXER
or MULTI_LEXER
to build indexes that let you search for Chinese text - but that isn't something I've ever looked at, so you'd need to experiment or find examples or other answers that can explore it. This documentation may or may not help...