Anything odd about Chinese unicode characters 稍 and 稊 that would affect KDiff3?

I have reported a bug and entered a support request at the KDiff3 site (https://sourceforge.net/p/kdiff3/bugs/198/), but I wonder if anyone has any prompt information for me about a behavior I'm seeing that might lead me to understanding why such a bug might exist -- if there's anything unusual about these unicode characters.

When I merge two identical files containing the character 稍 using KDiff3 version 0.9.98, it reads the character as 稊 and shows that character in all the panes of the merge. The output then contains that character instead of 稍.

I've observed this behavior with UCS-2 Little Endian encoding in version 0.9.98 of KDiff3, but not with UTF-8 encoding, and not with ~~version 0.9.96a~~ the version of Kdiff3 that comes with TortoiseHg. Although I can reproduce the problem in 0.9.96 and 0.9.97, TortoiseHg's KDiff3 reports that it is version 0.9.96a, and does not exhibit the problem.

Edit: I vaguely suspect the source of the problem to be somewhere in the Qt library. So any information about what Qt does especially in regard to handling international text might be useful.

Solution

Utilities that process text files need to break the text into characters to operate effectively. The simplest possible process is to treat each 8-bit byte as a single character. Unfortunately this doesn't work well with UTF-16 or UCS-2 input, since each byte is only half of the character.

The character you're having problems with is 稍 (U+7a0d) which is being converted to 稊 (U+7a0a). When you break those down into little-endian bytes, you get 0x0d, 0x7a and 0x0a, 0x7a. The 8-bit character 0x0d is the ASCII code for Return, and 0x0a is the code for Linefeed. It seems that KDiff3 is interpreting these bytes as line endings, and substituting a Linefeed when it encounters a Return. This is verified by your report of an error message indicating inconsistent line endings in the file.

When working with Unicode it is often better to use UTF-8 encoding. The characters above U+007f will still take up more than one byte, but each of those bytes will have a value of 0x80 or greater and cannot accidentally be mistaken for one of the ASCII characters. For example 稍 becomes 0xe7, 0xa8, 0x8d.