I'm writing some Java code that deals with Chinese characters, and I got some unexpected results -- strings that should be equal were not. Here is one of the offending characters, which means "six" (pinyin: liù): 六. This character can be represented with either of two code points:
F9D1 in the block: CJK Compatibility Ideographs
516D in the block: CJK Unified Ideographs
Wikipedia has a page about these character ranges, and the short section on compatibility ideographs does mention some duplicates, but the list omits this specific character.
So I'm wondering:
Just normalize them. U+F9D1 becomes U+516D under any of the four normalization schemes:
$ export PERL_UNICODE=S
$ perl -le 'print "\x{F9D1}\x{516D}"' | uniquote -v
\N{CJK COMPATIBILITY IDEOGRAPH-F9D1}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfd | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfc | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfkd | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
$ perl -le 'print "\x{F9D1}\x{516D}"' | nfkc | uniquote -v
\N{CJK UNIFIED IDEOGRAPH-516D}\N{CJK UNIFIED IDEOGRAPH-516D}
Many essential Unicode tools, including those, are available here.