Search code examples
sqloracle-database

How to replace unicode displaying in XML from SQL results


I have the following SQL query that (appears to have some kind of whitespace at the end)is returning data in my XML output and I am trying to remove this upfront in the SQL from being shown in the XML output:

SELECT 'H' KEY , 
REPLACE(REPLACE(LN.VENDOR_PRODUCT_NUM, CHR(8237), ''), CHR(8236), '') vendor_catalog_number

FROM MyTable

SQL results (doesn't appear to show the whitespace at the end):

KEY        vendor_catalog_number
H       ‭   ‭CUR251636‬

enter image description here

XML output:

<?xml version = '1.0' encoding = 'utf-8'?>
<DATA_DS>
<G_1>
<KEY>H</KEY>
<VENDOR_CATALOG_NUMBER><0x202d>‭CUR251636‬<0x202c></VENDOR_CATALOG_NUMBER>
</G_1>
</DATA_DS>

As a result the data is being rendered in my Oracle eText report template as ?CUR251636? (with the question marks around the string.)

I also tried using just for Linefeeds and Carriage returns as below but that still did not work:

REPLACE(REPLACE(LN.VENDOR_PRODUCT_NUM, CHR(13)), CHR(10)) vendor_catalog_number

What other Oracle SQL function can I use to remove this whitespace/unicode characters?

EDIT:

SELECT DUMP(LN.VENDOR_PRODUCT_NUM, 1016) FROM MyTable

Results: Typ=1 Len=15 CharacterSet=AL32UTF8: e2,80,ad,43,55,52,32,35,31,36,33,36,e2,80,ac

SELECT DUMP(LN.VENDOR_PRODUCT_NUM, 1016) FROM MyTable

Results: E280AD435552323531363336E280AC


Solution

  • Those are POP DIRECTIONAL FORMATTING and LEFT-TO-RIGHT OVERRIDE. No idea why they are there; perhaps from some conversion between writing systems on insert?

    Anyway... it seems you need the AL16UTF16 version of the codepoints, not the AL32UTF8 versions (not sure why, hopefully someone else can explain it...). Your code with CHR() doesn't replace anything:

    with cte as (
      SELECT 'H' KEY,
      REPLACE(REPLACE(LN.VENDOR_PRODUCT_NUM, CHR(8237), ''), CHR(8236), '') vendor_catalog_number
      FROM MyTable LN
    )
    select KEY, vendor_catalog_number,
      DUMP(vendor_catalog_number, 1016) as dmp,
      RAWTOHEX(UTL_RAW.CAST_TO_RAW(vendor_catalog_number)) as hex
    from cte
    
    KEY VENDOR_CATALOG_NUMBER DMP HEX
    H ‭CUR251636‬ Typ=1 Len=15 CharacterSet=AL32UTF8: e2,80,ad,43,55,52,32,35,31,36,33,36,e2,80,ac E280AD435552323531363336E280AC

    If you use the UNISTR() function (or add the USING NCHAR_CS option to CHR() then it does replace those:

    with cte as (
      SELECT 'H' KEY,
      REPLACE(REPLACE(LN.VENDOR_PRODUCT_NUM, UNISTR('\202d'), null), UNISTR('\202c'), null) vendor_catalog_number
      FROM MyTable LN
    )
    select KEY, vendor_catalog_number,
      DUMP(vendor_catalog_number, 1016) as dmp,
      RAWTOHEX(UTL_RAW.CAST_TO_RAW(vendor_catalog_number)) as hex
    from cte
    
    KEY VENDOR_CATALOG_NUMBER DMP HEX
    H CUR251636 Typ=1 Len=9 CharacterSet=AL32UTF8: 43,55,52,32,35,31,36,33,36 435552323531363336

    ... producing just your desired 9-character string 'CUR251636'.

    fiddle

    So instead of your current query:

    SELECT 'H' KEY , 
    REPLACE(REPLACE(LN.VENDOR_PRODUCT_NUM, CHR(8237), ''), CHR(8236), '') vendor_catalog_number
    FROM MyTable LN
    

    you can do:

    SELECT 'H' KEY,
    REPLACE(REPLACE(LN.VENDOR_PRODUCT_NUM, UNISTR('\202d'), null), UNISTR('\202c'), null) vendor_catalog_number
    FROM MyTable LN