Search code examples
oracle-databaseutf-8data-conversionlatin1

How to determine if characters in an Oracle DB field are within the UTF8 charset but outside of LATN-1 with SQL?


I currently have the need to write a SQL query to determine the number of rows in my UTF8 Oracle database that are not compatible with another system that uses LATIN-1.

For example, Über should not return a result, but 翻译 should

I have tried queries such as:

select decode(convert(convert('Über test', 'WE8ISO8859P1'), 'UTF8'), convert('Über test', 'UTF8'), 1, 0) from dual;

However, this does not give me the result that I need. Can anyone provide a SQL-only solution to this problem? Thanks


Solution

  • I think I have figured it out:

    select * from (select asciistr(convert('test string goes here', 'UTF8')) as str from dual) where regexp_like(str, '.*\\([1-9A-F]|0[1-9A-F]).*');

    Using http://en.wikipedia.org/wiki/Latin-1_Supplement_%28Unicode_block%29 as a reference, the LATIN-1 block of unicode ends at \00FF.

    For example,

    SQL> select * from (select asciistr(convert('翻译', 'UTF8')) as str from dual) where regexp_like(str, '.*\\([1-9A-F]|0[1-9A-F]).*');
    
    STR
    ------------------------------
    \7FFB\8BD1
    

    If someone could double-check this from a logical standpoint, I would appreciate it.