Search code examples
javacoldfusiondiacriticsicuicu4j

Transliterate German umlauts using icu4j into their DIN 5007-2 alternatives


I would like to be able to transliterate German umlaut characters

Ü ü ö ä Ä Ö

into their DIN 5007-2 alternatives

ä → ae
ö → oe
ü → ue
Ä → Ae
Ö → Oe
Ü → Ue
ß → ss (or SZ)

like in this case:

https://german.stackexchange.com/questions/4992/conversion-table-for-diacritics-e-g-%C3%BC-%E2%86%92-ue

The most relevant use case I found was: https://github.com/elastic/elasticsearch-analysis-icu/blob/master/src/test/java/org/elasticsearch/index/analysis/SimpleIcuCollationTokenFilterTests.java

where on line 208 they do

String DIN5007_2_tailorings =
            "& ae , a\u0308 & AE , A\u0308"+
            "& oe , o\u0308 & OE , O\u0308"+
            "& ue , u\u0308 & UE , u\u0308";

I would like to avoid creating complex Java code, like defining custom tailorings and all that's required. I want to keep the code as simple as possible, because I have to use this code inside a ColdFusion application.

I experimented a little with

var instance = Transliterator.getInstance("Latin-ASCII");

and

var instance = Transliterator.getInstance("any-NFD; [:nonspacing mark:] any-remove; any-NFC");

and their variants, they all result in:

 writeDump(instance.transliterate('Häuser Bäume Höfe Gärten daß Ü ü ö ä Ä Ö ß '));

 Hauser Baume Hofe Garten dass U u o a A O ss 

If it's possible I would like to stick to using .getInstance() method. Question here is what is the ID string for the .getInstance() method that would result in transliterating umlauts into their DIN 5007-2 equivalents?


Solution

  • Updating on this as there is now a simple solution using "de-ASCII":

    Transliterator transliterator = Transliterator.getInstance("de-ASCII");
    String umlautReplaced = transliterator.transliterate(txt);