Search code examples
javapostgresqlunicodediacriticsunicode-normalization

How to deal with decomposed unicode from external source and store it in postgresql


I am receiving xml documents in UTF-8 with decomposed diacretics.

Until now, I just read the values and saved them into my postgreSQL database without doing any modifications or conversions (psotgreSQL params: SERVER_ENCODING=UTF8, LC_COLLATE=German_Germany.1252).

Now I discovered, that when searching for a value like "Wüste" typing it in manually (HEX: 57 c3 bc 73 74 65), I am unable to find the value previously inserted from that external resource "Wüste" (HEX: 57 75 cc 88 73 74 65).

I can see that the two Strings differ when viewing them in Notepad++ HexEditor.

I am trying to figure out the correct solution to this issue.

Currently I'd try using java.text.Normalizer and try to normalize the text when reading it from external source like this:

String normalized = Normalizer.normalize(original, Normalizer.Form.NFC);

But before doing so, I'd like to make sure I don't miss anything else. Like maybe I could and should address this issue already while reading the data from the external source? Like maybe I need to provide the correct encoding? Maybe there is another parameter I can set so I don't need to do any normalization myself?

I am reading the data like this:

URL url = new URL(buildSearchUrl(searchCriteria, key));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
String expression = "/s:searchRetrieveResponse/s:records/s:record/s:recordData";
DocumentBuilder db = dbf.newDocumentBuilder();
InputStream is = url.openStream();
Document dnbResultDocument = db.parse(is);
Node recordDataElem = (Node) xpath.compile(expression).evaluate(
dnbResultDocument, XPathConstants.NODE);

I then proceed reading different nodes and getting their textual values using

element.getTextContent()

I find it kind of hard to believe that I somewhat manually have to normalize every text I get from said external resource.


Solution

  • PostgreSQL (and the underlying operating system locale libraries) doesn't really support this very well. It's probably best if you normalize your data before you pass it to the database, as you describe.