Search code examples
oracleutf-8character-encodingwindows-1252

Get non-ASCII character from single character code


I have a couple databases where the NLS character encoding differs. One uses AL32UTF8 (UTF-8), and the other uses WE8MSWIN1252 (Windows 1252). Yes, this is bad, and I will push for getting things straightened out so they're the same. In the meantime, I need to write some SQL containing a non-ASCII character. For example, I need to use á, which is U+00E1 in Unicode. In UTF-8, it's represented with the hexadecimal value 0xC3A1, and in Windows 1252 it's represented with the hexadecimal value 0x00E1. (In other words, those are the byte values.)

So this gives me what I want on the UTF-8 database:

SELECT CHR(TO_NUMBER('C3A1', 'xxxx')) FROM DUAL;

And this gives me what I want on the Windows 1252 database:

SELECT CHR(TO_NUMBER('00E1', 'xxxx')) FROM DUAL;

I can't seem to figure out a way to get Oracle to just take one and convert it to the right encoding, though. I don't really care which representation I end up using. I just want to only use one hexadecimal value since the same script needs to run on both databases. How can I do that?

Using Oracle 11g. (Unfortunately, one is 11.1 and the other is 11.2 for the moment.)


Solution

  • Two functions can you help UNISTR and TO_CHAR:

    1. To return in the national character set of the database:

       SELECT UNISTR('\00E1') FROM DUAL;
      
    2. To return in the database character set:

       SELECT TO_CHAR(UNISTR('\00E1')) FROM DUAL;
      

    From UNISTR documentation:

    UNISTR takes as its argument a text literal or an expression that resolves to character data and returns it in the national character set. The national character set of the database can be either AL16UTF16 or UTF8. UNISTR provides support for Unicode string literals by letting you specify the Unicode encoding value of characters in the string. This is useful, for example, for inserting data into NCHAR columns.

    The Unicode encoding value has the form '\xxxx' where 'xxxx' is the hexadecimal value of a character in UCS-2 encoding format. Supplementary characters are encoded as two code units, the first from the high-surrogates range (U+D800 to U+DBFF), and the second from the low-surrogates range (U+DC00 to U+DFFF). To include the backslash in the string itself, precede it with another backslash (\).

    For portability and data preservation, Oracle recommends that in the UNISTR string argument you specify only ASCII characters and the Unicode encoding values.

    From TO_CHAR(character) documenation:

    TO_CHAR(character) converts NCHAR, NVARCHAR2, CLOB, or NCLOB data to the database character set. The value returned is always VARCHAR2.