python unicode normalization python-unicode unicode-normalization

Efficient way to check if unicode string is NFC in Python?

I want to check if a string is already in NFC form. Currently I do:

unicodedata.normalize('NFC', s) == s

I am doing this for a large number of strings, so I would like to be efficient. The above method seems wasteful. It converts to NFC, and then does a string comparison.

Is there a more efficient way to do it? I have considered:

len(unicodedata.normalize('NFC', s)) == len(s)

This avoids the string comparison. But I am not sure this is always correct. This works if NFC normalization always changes the length of a non NFC string. Is that a valid assumption?

Any other ideas?

Solution

Normalising doesn't necessarily change the length of a string. For example, 'Ω' (U+2126) becomes 'Ω' (U+03A9) after NFC.

There is a normalisation "quick check" property in the Unicode database to test whether a character is already normalised, but unfortunately Python's unicodedata module doesn't expose it. However, unicodedata.normalize() does use this property to avoid doing any extra work if the string is already normalised—it simply returns the input string.

To access this property, you will either need to compile a table yourself from the Unicode Character Database, or use a broader Unicode library with Python bindings (like PyICU).