Search code examples
encodingcharacter-encodingibm-integration-bus

Parser is not encoding string correctly


Text I'm trying to get:

przełącznica

This is what I actually have (browser might now view it properly - there are two squares instead of "łą"):

przecznica

BLOB:

70 72 7A 65 C5 82 C4 85 63 7A 6E 69 63 61

EDIT: This is what I get from parser

70 72 7A 65 1A 1A 63 7A 6E 69 63 61

ESQL used to parse BLOB:

DECLARE blobMsg BLOB InputRoot.BLOB.BLOB ;
         CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN ('XMLNSC') NAME 'XMLNSC';
         CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC PARSE(blobMsg OPTIONS FolderBitStream CCSID 1208 FORMAT 'XMLNSC');

I have tried CCSIDs: 1208 (UTF8), 912 (ISO-8859-2), 1200(UTF16 I guess): https://www.ibm.com/support/knowledgecenter/ssw_ibm_i_71/nls/rbagsccsidcdepgscharsets.htm

EDIT: Working code:

DECLARE blobMsg BLOB InputRoot.BLOB.BLOB;
DECLARE remove BLOB X'EFBBBF';
DECLARE message BLOB REPLACE(InputRoot.BLOB.BLOB, remove, CAST('' AS BLOB));
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN ('XMLNSC') NAME 'XMLNSC';
CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC PARSE(message OPTIONS FolderBitStream CCSID 05348 FORMAT 'XMLNSC');

Solution

  • Firstly przełącznica by itself is not valid XML and so you'll get an exception when you try to invoke the XMLNSC parser using the code you have outlined. You need to do a CAST instead.

    I generated a little test Application/MsgFlow in IIB 10 to illustrate CASTing the BLOB.

    Simple Encoding App

    The code in ConvertAndParse is

    CREATE COMPUTE MODULE ConvertAndParse
    CREATE FUNCTION Main() RETURNS BOOLEAN
    BEGIN
        DECLARE blobMsg BLOB X'70727A65C582C485637A6E696361';
        CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg DOMAIN 'XMLNSC';
        CREATE LASTCHILD OF OutputLocalEnvironment.Variables.inpMsg.XMLNSC NAME 'AsUtf8' VALUE CAST(blobMsg AS CHAR CCSID 1208);
    
        CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC';
        CREATE LASTCHILD OF OutputRoot.XMLNSC.EncodingResponse NAME 'AsUtf8InTag' VALUE CAST(blobMsg AS CHAR CCSID 1208);
        CREATE LASTCHILD OF OutputRoot.XMLNSC.EncodingResponse NAME CAST(blobMsg AS CHAR CCSID 1208) VALUE 'As a tag name';
    
        RETURN TRUE;
    END;
    END MODULE;
    

    When I run a debug session the value put into the LocalEnvironment tree looks like.

    Debug Values

    And the result of invoking the flow from a browser.

    Browser Result

    Now let's deal with the which encoding we are looking at. Looking at what I assume is the input BLOB let's see if the BLOB matches up with UTF-8.

    70 72 7A 65 C5 82 C4 85 63 7A 6E 69 63 61
    

    UTF-8 is a variable width character encoding that sets the high order bit to indicate two or more bytes. We also want a page that shows the common code points for UTF-8 Complete Character List for UTF-8. Note it's not actually complete.

    Looking at the first 4 bytes none of them have the high order bit on

    70 72 7A 65 
    

    And the aforementioned Character List says that's prze, so far so good.

    Then we hit C8 which has the high order bit on. Doing a bit of visual parsing we get two sets of probable two byte character pairs

    C5 82
    C4 85
    

    Referring to the Character List our two candidate pairs do in fact match the two characters we want and the next six characters which do not have their high order bits on translate to cznica. Looking really good.

    Now to eliminate the other candidate encodings, if we can.

    UTF-16 uses 2 or 4 bytes to represent each character depending on the Byte Order Mark with prze encoded as

    UTF-16BE - CP 1200 - 00 70 00 72 00 7A 00 65
    UTF-16LE - CP 1202 - 70 00 72 00 7A 00 65 00
    

    Given that there are not lots and lots of null characters 00 it is reasonable to discount UTF-16.

    ISO-8859-2 - CP 912 is a single byte character set and the C5 and C4 code points do not match the two desired characters and thus we can eliminate it.