Search code examples
delphifiremonkeydelphi-xe7omnixml

OmniXML on iOS: Invalid Unicode


I recently switched to use the OmniXML included with Delphi XE7, to allow targeting iOS. The XML data comes from a cloud service and includes nodes with base64 encoded binary data.

Now I get this exeception "Invalid Unicode Character value for this platform" when calling XMLDocument.LoadFromStream, and it seems to be this base64 linebreak sequence that fails: 

The nodes with base64 data looks similar to this:

<data>TVRMUQAAAAIAAAAAFFo3FAAUAAEA8AADsAAAAEAAAABAAHAAwABgAAAAAAAAAAAQEBAAAAAAAA&#xD;
AAMQAAABNUgAAP/f/AAMABAoAAAAEAAAAAEVNVExNAAAAAQAAAAAUWjcUABQAAQD/wAA&#xD;
AAA=</data>

I traced it down to these lines in XML.Internal.OmniXML:

  psCharHexRef:
    if CharIs_WhiteSpace(ReadChar) then
      raise EXMLException.CreateParseError(INVALID_CHARACTER_ERR, MSG_E_UNEXPECTED_WHITESPACE, [])
    else
    begin
      case ReadChar of
        '0'..'9': CharRef := LongWord(CharRef shl 4) + LongWord(Ord(ReadChar) - 48);
        'A'..'F': CharRef := LongWord(CharRef shl 4) + LongWord(Ord(ReadChar) - 65 + 10);
        'a'..'f': CharRef := LongWord(CharRef shl 4) + LongWord(Ord(ReadChar) - 97 + 10);
        ';':
          if CharIs_Char(Char(CharRef)) then
          begin
            Result := Char(CharRef);
            Exit;
          end
          else
            raise EXMLException.CreateParseError(INVALID_CHARACTER_ERR, MSG_E_INVALID_UNICODE, []);

It is the exception in the last line that is raised because CharIs_Char(#13) is false (where #13 is the value of CharRef read from &#xD;)

How do I solve this?


Solution

  • This is clearly a bug in OmniXML. It looks like the developers were trying to implement XML1.0 which states :

    ...XML processors MUST accept any character in the range specified for Char.

    Character Range

    [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

    /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

    The implementation of CharIs_Char, however looks like :

    function CharIs_Char(const ch: Char): Boolean;
    begin
      // [2] Char - any Unicode character, excluding the surrogate blocks, FFFE, and FFFF
      Result := not Ch.IsControl;
    end;
    

    This is excluding all control characters, which include #x9(TAB), #xA(LF) and #xD(CR). In fact, since XML strips (or optionally replaces with LF) carriage return literals during parsing, the only way to include an actual carriage return is using a character reference in an entity value literal (section 2.3 of the specification).

    This seems like a showstopper and should be submitted as a QC report.