Search code examples
javaidn

java IDN functions not reversible?


Why is there some IDN not reversible :

String domain = "aʼnċăwb7rňuħ.eu";
System.out.println(domain);
domain = IDN.toASCII(domain);
System.out.println(domain);
domain = IDN.toUnicode(domain);
System.out.println(domain);

It displays :

aʼnċăwb7rňuħ.eu
xn--anwb7ru-93a5e8ozmq2m.eu
aʼnċăwb7rňuħ.eu

As you can see, the second character has been splitted !

Thanks


Solution

  • This is by design. From what I can tell, the 2nd character in your string is a \u0149 codepoint. According to the latest Unicode code charts:

    this character is deprecated and its use is strongly discouraged

    The Unicode code chart says that the deprecated code point is equivalent to \u02bc followed by \u006e.

    The according to the javadocs, first step that IDN.toASCII(String) does is to use the RFC 3491 stringprep / nameprep algorithm to process the characters in the input string. The RFC abstract says:

    This document describes how to prepare internationalized domain name (IDN) labels in order to increase the likelihood that name input and name comparison work in ways that make sense for typical users throughout the world. This profile of the stringprep protocol is used as part of a suite of on-the-wire protocols for internationalizing the Domain Name System (DNS).

    (In other words, stringprep is designed to make it harder to create tricky domain names that look like one thing and mean something different.)

    In fact, if you drill down, you will find that the prescribed mapping in stringprep tables for \u0149 is \u02bc \u006e ; i.e. the equivalent defined in the Unicode code charts.

    And ... that is what is happening.


    Summary

    1. Your expectation that you can round-trip IDNs is ill-founded.
    2. You shouldn't be using that character anyway, since it is deprecated. (Certainly, it is a bad idea to use it in an IDN!)