I want to load a file into plsql clob variable and then perform some regexp transformations on it. The problem I have is that somehow I get my charset messed up.
So far I tried:
declare
l_filename varchar2(100) := 'sample.txt';
l_clob clob;
l_bfile bfile;
begin
dbms_lob.createtemporary(l_clob, true);
l_bfile := bfilename( 'SAMPLE_DIR', l_filename );
dbms_lob.fileopen( l_bfile );
dbms_lob.loadfromfile( l_clob, l_bfile, dbms_lob.getlength( l_bfile ));
dbms_lob.fileclose( l_bfile );
dbms_output.put_line(l_clob);
end;
/
I created a flat file 'sample.txt' with 'test file' in it and when I'm eventually printing it I get '瑥獴楬攊'. The charset I use in my db is utf-8. Why did my encoding got messed up?
I found some online encoder https://www.urlencoder.org/. When I put my Chinese letters in it and used UTF16-BE as a destination charset I got 'test%20file%0A'. But I still don't know why my encoding got messed up.
Are you sure your db character set is AL32UTF8 and your file on disk is ASCII? Because it looks like something, somewhere is AL16UTF16 (which I think is the same as UTF-16BE).
Your test string "test file\n" encodes to UTF8 bytes 0x746573742066696C650A. You actually have 5 Asian characters because the third is a non-printable character, LEFT-TO-RIGHT ISOLATE.
When I convert "test file\n" to UTF16, I get the following:
select dump( utl_i18n.raw_to_char( hextoraw('746573742066696C650A'), 'AL16UTF16' ), 1016 )
from dual;
Typ=1 Len=15 CharacterSet=AL32UTF8: e7,91,a5,e7,8d,b4,e2,81,a6,e6,a5,ac,e6,94,8a
select dump( '瑥獴楬攊', 1016) from dual;
Typ=96 Len=15 CharacterSet=AL32UTF8: e7,91,a5,e7,8d,b4,e2,81,a6,e6,a5,ac,e6,94,8a
Note that the byte sequences are the same. The second dump()
has Typ=96
, which is an NCHAR
value. Do you get the same character sets for your database as mine?
select * from nls_database_parameters
where parameter IN ('NLS_CHARACTERSET', 'NLS_NCHAR_CHARACTERSET' );
PARAMETER VALUE
NLS_CHARACTERSET AL32UTF8
NLS_NCHAR_CHARACTERSET AL16UTF16