Search code examples
unicodeuca

How does the handling of combining characters in the Unicode Collation Algorithm work?


I maintain an open-source, pure-Python implementation of the Unicode Collation Algorithm called pyuca.

While it meets my needs in sorting Ancient Greek text (and seems to meet the needs of many other people), I'm looking to improve its coverage of rarer cases by getting it to the point where it passes the entire suite of official conformance tests.

However, 1,869 of the tests (just over 1%) fail. The first failure is at 0332 0334 which the test files suggest should get the sort key | 004A 0021 | 0002 0002 |.

pyuca, however, forms the sort key | 0021 004A | 0002 0002 |.

At first I thought this might be due to lack of support for non-starter characters (S2.1.1 thru S2.1.3 of the algorithm in the latest spec). However, my subsequent implementation of this part did nothing to change the sort key and a manual working through the algorithm on paper also fails to trigger that section which has me wondering if I'm just missing something.

The relevant steps in the algorithm are:

S2.1.1 If there are any non-starters following S, process each non-starter C.
S2.1.2 If C is not blocked from S, find if S + C has a match in the table.
S2.1.3 If there is a match, replace S by S + C, and remove C.

The key phrase is "If there is a match". In the test mentioned above that fails, there is no match for 0332 0334 and so this part of the algorithm cannot explain why the sort key should be in a different order to what my implementation produces.

Can anyone explain what part of the UCA would form a sort key like the test file suggests?


Solution

  • Does it work better if you shove the string into Normalization Form D first? (Step 1.)

    This is an utter wild guess based on the fact that 0332 0334 is not in NFD. I haven't tried to work through the algorithm at all.