Search code examples
unicodeutf-8rustnormalizationunicode-normalization

How to detect unicode characters that are non-normalized?


Given a UTF-8 string (&str), I want to find out any range of characters that are not normalized (e.g. a\u{300} instead of \u{e0}).

How do I do this?

Edit: Thanks to DK for correcting my faulty UTF-8 sequence. The combining character comes after the a, not before.


Solution

  • Edit: I just realised that the reason for the results I was getting is that your example string is backwards. The combining code point should come second, not first. I've updated the answer accordingly.

    Well, that depends on the definition of "normalized".

    For example:

    /*!
    Add this to a `Cargo.toml` manifest:
    
    ```cargo
    [dependencies]
    unicode-normalization = "0.1.1"
    ```
    */
    extern crate unicode_normalization;
    
    fn main() {
        for test_str in vec!["a\u{300}", "\u{e0}"] {
            is_nfd(test_str);
            is_nfkd(test_str);
            is_nfc(test_str);
            is_nfkc(test_str);
        }
    }
    
    macro_rules! norm_test {
        ($fn_name:ident, $norm_name:ident) => {
            fn $fn_name(s: &str) {
                use unicode_normalization::UnicodeNormalization;
                println!("is_{}({:?}):", stringify!($norm_name), s);
                let is_norm = s.chars().zip(s.$norm_name())
                    .inspect(|&(a, b)| println!(" - ({:x}, {:x})", a as u32, b as u32))
                    .all(|(a, b)| a == b);
                println!(" is_norm: {}", is_norm);
            }
        };
    }
    
    norm_test! { is_nfd, nfd }
    norm_test! { is_nfkd, nfkd }
    norm_test! { is_nfc, nfc }
    norm_test! { is_nfkc, nfkc }
    

    This produces the following output:

    is_nfd("a\u{300}"):
     - (61, 61)
     - (300, 300)
     is_norm: true
    is_nfkd("a\u{300}"):
     - (61, 61)
     - (300, 300)
     is_norm: true
    is_nfc("a\u{300}"):
     - (61, e0)
     is_norm: false
    is_nfkc("a\u{300}"):
     - (61, e0)
     is_norm: false
    is_nfd("\u{e0}"):
     - (e0, 61)
     is_norm: false
    is_nfkd("\u{e0}"):
     - (e0, 61)
     is_norm: false
    is_nfc("\u{e0}"):
     - (e0, e0)
     is_norm: true
    is_nfkc("\u{e0}"):
     - (e0, e0)
     is_norm: true
    

    So "a\u{300}" is NFD and NFKD, whilst "\u{e0}" is NFC and NFKC. I don't know of any examples which differ between the K and non-K variants, though the Unicode FAQ on Normalization will probably explain things better than I can.