oracle-database file character-encoding clob

Clob with inappropriate charset

I want to load a file into plsql clob variable and then perform some regexp transformations on it. The problem I have is that somehow I get my charset messed up.

So far I tried:

declare 
 l_filename varchar2(100) := 'sample.txt';
 l_clob clob;
 l_bfile bfile;
 begin
 dbms_lob.createtemporary(l_clob, true);
 l_bfile := bfilename( 'SAMPLE_DIR', l_filename );
 dbms_lob.fileopen( l_bfile );
 dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength( l_bfile ));
 dbms_lob.fileclose( l_bfile );
 dbms_output.put_line(l_clob);
 end;
/

I created a flat file 'sample.txt' with 'test file' in it and when I'm eventually printing it I get '瑥獴⁦楬攊'. The charset I use in my db is utf-8. Why did my encoding got messed up?

I found some online encoder https://www.urlencoder.org/. When I put my Chinese letters in it and used UTF16-BE as a destination charset I got 'test%20file%0A'. But I still don't know why my encoding got messed up.

Solution

Are you sure your db character set is AL32UTF8 and your file on disk is ASCII? Because it looks like something, somewhere is AL16UTF16 (which I think is the same as UTF-16BE).

Your test string "test file\n" encodes to UTF8 bytes 0x746573742066696C650A. You actually have 5 Asian characters because the third is a non-printable character, LEFT-TO-RIGHT ISOLATE.

When I convert "test file\n" to UTF16, I get the following:

select dump( utl_i18n.raw_to_char( hextoraw('746573742066696C650A'), 'AL16UTF16' ), 1016 ) 
from dual;

Typ=1 Len=15 CharacterSet=AL32UTF8: e7,91,a5,e7,8d,b4,e2,81,a6,e6,a5,ac,e6,94,8a

select dump( '瑥獴⁦楬攊', 1016) from dual;

Typ=96 Len=15 CharacterSet=AL32UTF8: e7,91,a5,e7,8d,b4,e2,81,a6,e6,a5,ac,e6,94,8a

Note that the byte sequences are the same. The second dump() has Typ=96, which is an NCHAR value. Do you get the same character sets for your database as mine?

select * from nls_database_parameters 
where parameter IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET' );

PARAMETER               VALUE
NLS_CHARACTERSET        AL32UTF8
NLS_NCHAR_CHARACTERSET  AL16UTF16