Search code examples
oracle-databaseutf-8latin1

Oracle: Convert xml entities in a varchar2 field to utf-8 characters


I have a field in a table which holds XML entities for special characters, since the table is in latin-1. E.g. "Hallöle slovenčina" (the "ö" is in latin-1, but the "č" in "slovenčina" had to be converted to an entity by some application that stores the values into the database)

Now I need to export the table into a utf-8 encoded file by converting the XML entities to their original characters.

Is there a function in Oracle that might handle this for me, or do I really need to create a huge key/value map for that?

Any help is greatly appreciated.

EDIT: I found the function DBMS_XMLGEN.convert, but it only works on <,> and &. Not on &#NNN; :-(


Solution

  • I believe the problem with dbms_xmlgen is that there are technically only five XML entities. Your example has a numeric HTML entity, which corresponds with Unicode:

    http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=0100&end=017F

    Oracle has a function UNISTR which is helpful here:

    select unistr('sloven\010dina') from dual;
    

    I've converted 269 to its hex equivalent 010d in the example above (in Unicode it is U+010D). However, you could pass a decimal number and do a conversion like this:

    select unistr('sloven\' || replace(to_char(269, 'xxx'), ' ', '0') || 'ina') from dual;
    

    EDIT: The PL/SQL solution:

    Here's an example I've rigged up for you. This should loop over and replace any occurrences for each row you select out of your table(s).

    create table html_entities (
        id NUMBER(3),
        text_row VARCHAR2(100)
    );
    
    INSERT INTO html_entities 
    VALUES (1, 'Hallöle sloven&#269;ina &#266; &#250;');
    
    INSERT INTO html_entities 
    VALUES (2, 'I like the letter &#266;');
    
    INSERT INTO html_entities 
    VALUES (3, 'Nothing to change here.');
    
    DECLARE
        v_replace_str NVARCHAR2(1000);
        v_fh UTL_FILE.FILE_TYPE;       
    BEGIN
        --v_fh := utl_file.fopen_nchar(LOCATION IN VARCHAR2, FILENAME IN VARCHAR2, OPEN_MODE IN VARCHAR2, MAX_LINESIZE IN BINARY_INTEGER);
    
        FOR v_rec IN (select id, text_row from html_entities) LOOP
            v_replace_str := v_rec.text_row;
            WHILE (REGEXP_INSTR(v_replace_str, '&#[0-9]+;') <> 0) LOOP
                v_replace_str := REGEXP_REPLACE(
                    v_replace_str, 
                    '&#([0-9]+);',
                    unistr('\' || replace(to_char(to_number(regexp_replace(v_replace_str, '.*?&#([0-9]+);.*$', '\1')), 'xxx'), ' ', '0')),
                    1,
                    1
                );
            END LOOP;
    
            -- utl_file.put_line_nchar(v_fh, v_replace_str);
            dbms_output.put_line(v_replace_str);
    
        END LOOP;
        --utl_file.fclose(v_fh);
    END;
    /
    

    Notice that I've stubbed in calls to the UTL_FILE function to write NVARCHAR lines (Oracle's extended character set) to a file on the database server. The dbms_output, while great for debugging, doesn't seem to support extended characters, but this shouldn't be a problem if you use UTL_FILE to write to a file. Here's the DBMS_OUTPUT:

    Hallöle slovencina C ú
    I like the letter C
    Nothing to change here.