We have problem when saving XML data ( UTF-8) encoded to DB2 9.7 LUW in table.
Table DDL:
CREATE TABLE DB2ADMIN.TABLE_FOR_XML
(
ID INTEGER NOT NULL,
XML_FIELD XML NOT NULL
)
Problem occurs in some rare examples with rare Unicode characters, we are using java jdbc db2 driver.
For example looking in editor in normal mode not in hex view (Notepad++) this strange A below (after 16.) is represented as NEL in blacks square
Input XML is in UTF-8 encoding and when looked in HEX editor has this values:
00000010h: 31 36 2E 20 C2 85 42 ; 16. Â…B
After inserting in DB2 I presume that some kind of conversion occurs because when selecting data back this same character are now
00000010h: 31 36 2E 20 0D 0A 42 ; 16. ..B
C2 85 is transformed into 0D 0A that is new line.
One another thing I noticed that although when saving XML into table header content was starting with
<xml version="1.0" encoding="UTF-8">
but after fetching xml from db2 content was starting with
<xml version="1.0" encoding="UTF-16">
Is there way to force db2 to store XML in UTF-8 without conversions ? Fetching with XMLSERIALIZE didn't help
SELECT XML_FIELD AS CONTENT1, XMLSERIALIZE(XML_FIELD as cLOB(1M)) AS CONTENT2 from DB2ADMIN.TABLE_FOR_XML
IN content2 there is no XML header but stile newLine is there.
This behaviour is standard for XML 1.1 processors. XML 1.1 s2.11:
the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating [the single character #x85] to a single #xA character
Line ending type is one of the many details of a document that will be lost over a parse-and-serialise cycle (eg attribute order, whitespace in tags, numeric character references...).
It's slightly surprising that DB2's XML fields are using XML 1.1 since not much uses that revision of XML, but not super-surprising in that support for NEL (ancient, useless mainframe line ending character) is something only IBM ever wanted.
Is there way to force db2 to store XML in UTF-8 without conversions ?
Use a BLOB?
If you need both native-XML-field functionality and to retain the exact original serialised form of a document then you'll need two columns.
(Are you sure you need to retain NEL line endings? Nobody usually cares about line endings, and these are pretty bogus.)