Search code examples
oracle-databasefilecharacter-encodingclob

Clob with inappropriate charset


I want to load a file into plsql clob variable and then perform some regexp transformations on it. The problem I have is that somehow I get my charset messed up.

So far I tried:

declare 
 l_filename varchar2(100) := 'sample.txt';
 l_clob clob;
 l_bfile bfile;
 begin
 dbms_lob.createtemporary(l_clob, true);
 l_bfile := bfilename( 'SAMPLE_DIR', l_filename );
 dbms_lob.fileopen( l_bfile );
 dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength( l_bfile ));
 dbms_lob.fileclose( l_bfile );
 dbms_output.put_line(l_clob);
 end;
/

I created a flat file 'sample.txt' with 'test file' in it and when I'm eventually printing it I get '瑥獴⁦楬攊'. The charset I use in my db is utf-8. Why did my encoding got messed up?

I found some online encoder https://www.urlencoder.org/. When I put my Chinese letters in it and used UTF16-BE as a destination charset I got 'test%20file%0A'. But I still don't know why my encoding got messed up.


Solution

  • Are you sure your db character set is AL32UTF8 and your file on disk is ASCII? Because it looks like something, somewhere is AL16UTF16 (which I think is the same as UTF-16BE).

    Your test string "test file\n" encodes to UTF8 bytes 0x746573742066696C650A. You actually have 5 Asian characters because the third is a non-printable character, LEFT-TO-RIGHT ISOLATE.

    When I convert "test file\n" to UTF16, I get the following:

    select dump( utl_i18n.raw_to_char( hextoraw('746573742066696C650A'), 'AL16UTF16' ), 1016 ) 
    from dual;
    
    Typ=1 Len=15 CharacterSet=AL32UTF8: e7,91,a5,e7,8d,b4,e2,81,a6,e6,a5,ac,e6,94,8a
    
    select dump( '瑥獴⁦楬攊', 1016) from dual;
    
    Typ=96 Len=15 CharacterSet=AL32UTF8: e7,91,a5,e7,8d,b4,e2,81,a6,e6,a5,ac,e6,94,8a
    

    Note that the byte sequences are the same. The second dump() has Typ=96, which is an NCHAR value. Do you get the same character sets for your database as mine?

    select * from nls_database_parameters 
    where parameter IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET' );
    
    PARAMETER               VALUE
    NLS_CHARACTERSET        AL32UTF8
    NLS_NCHAR_CHARACTERSET  AL16UTF16