The unicode character U+FA8E CJK COMPATIBILITY IDEOGRAPH-FA8E
is a compatibility character mapped to U+641C [CJK Unified Ideographs]
. In Java 6 NFC
normalization leaves it U+FA8E
, while in Java 7 it does decompose it to U+641C
?
When running this small snippet:
String fancyChar = "\uFA8E";
String normalized = Normalizer.normalize(fancyChar, Normalizer.Form.NFC);
System.out.printf("%04x == %04x\n", (int)(fancyChar.charAt(0)), (int)(normalized.charAt(0)));
System.out.println(fancyChar.equals(normalized));
In Java 6 (latest versions of both Sun/Oracle and OpenJDK):
fa8e == fa8e
true
In Java 7 (latest versions of both Sun/Oracle and OpenJDK):
fa8e == 641c
false
So my question is, why has this changed?
Reading the UNICODE NORMALIZATION FORMS it seems NFC should not decompose characters with compatibility mapping?
But the fact that both Oracle and OpenJDK have switched this for Java 7 makes me wonder.
The character U+FA8E has canonical mapping to U+641C. The authoritative reference on this is the UnicodeData.txt file in the Unicode Character Database. Thus, the correct NFC form of U+FA8E is U+641C.
So this is apparently a bug fix. It seems to affect other characters in the same group, too.