I intend to normalize to Form C, then divide into "display units", basically a glyph plus all following combining characters. For now, I'm just looking to handle the Latin-based scripts.
To determine if a code point is a combining character, is it enough to check that it is within these ranges?
Arabic, Hebrew and various Indian scripts pending...
These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT
):
300-36F
483-489
7EB-7F3
135F-135F
1A7F-1A7F
1B6B-1B73
1DC0-1DE6
1DFD-1DFF
20D0-20F0
2CEF-2CF1
2DE0-2DFF
3099-309A
A66F-A672
A67C-A67D
A6F0-A6F1
A8E0-A8F1
FE20-FE26
101FD-101FD
1D165-1D169
1D16D-1D172
1D17B-1D182
1D185-1D18B
1D1AA-1D1AD
1D242-1D244
I compiled this list with a Python script, making use of the unicodedata
module.
I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.
However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.