Search code examples
unicode

Algorithm to check for combining characters in Unicode


I intend to normalize to Form C, then divide into "display units", basically a glyph plus all following combining characters. For now, I'm just looking to handle the Latin-based scripts.

To determine if a code point is a combining character, is it enough to check that it is within these ranges?

  • Combining Diacritical Marks (0300–036F)
  • Combining Diacritical Marks Supplement (1DC0–1DFF)
  • Combining Diacritical Marks for Symbols (20D0–20FF)
  • Combining Half Marks (FE20–FE2F)

Arabic, Hebrew and various Indian scripts pending...


Solution

  • These are all the ranges of Unicode points, whose name contains the word 'combining' (e.g. 301 COMBINING ACUTE ACCENT):

    300-36F
    483-489
    7EB-7F3
    135F-135F
    1A7F-1A7F
    1B6B-1B73
    1DC0-1DE6
    1DFD-1DFF
    20D0-20F0
    2CEF-2CF1
    2DE0-2DFF
    3099-309A
    A66F-A672
    A67C-A67D
    A6F0-A6F1
    A8E0-A8F1
    FE20-FE26
    101FD-101FD
    1D165-1D169
    1D16D-1D172
    1D17B-1D182
    1D185-1D18B
    1D1AA-1D1AD
    1D242-1D244

    I compiled this list with a Python script, making use of the unicodedata module. I don't know what version of Unicode this is exactly, but I think it's reasonably up to date.

    However, I don't know if you're done with characters that are 'combining' in the strict sense, as there are also 'modifier letters' and the like in Unicode.