unicode utf-8 rust normalization unicode-normalization

How to detect unicode characters that are non-normalized?

Given a UTF-8 string (&str), I want to find out any range of characters that are not normalized (e.g. a\u{300} instead of \u{e0}).

How do I do this?

Edit: Thanks to DK for correcting my faulty UTF-8 sequence. The combining character comes after the a, not before.

Solution

Edit: I just realised that the reason for the results I was getting is that your example string is backwards. The combining code point should come second, not first. I've updated the answer accordingly.

Well, that depends on the definition of "normalized".

For example:

/*!
Add this to a `Cargo.toml` manifest:

```cargo
[dependencies]
unicode-normalization = "0.1.1"
```
*/
extern crate unicode_normalization;

fn main() {
    for test_str in vec!["a\u{300}", "\u{e0}"] {
        is_nfd(test_str);
        is_nfkd(test_str);
        is_nfc(test_str);
        is_nfkc(test_str);
    }
}

macro_rules! norm_test {
    ($fn_name:ident, $norm_name:ident) => {
        fn $fn_name(s: &str) {
            use unicode_normalization::UnicodeNormalization;
            println!("is_{}({:?}):", stringify!($norm_name), s);
            let is_norm = s.chars().zip(s.$norm_name())
                .inspect(|&(a, b)| println!(" - ({:x}, {:x})", a as u32, b as u32))
                .all(|(a, b)| a == b);
            println!(" is_norm: {}", is_norm);
        }
    };
}

norm_test! { is_nfd, nfd }
norm_test! { is_nfkd, nfkd }
norm_test! { is_nfc, nfc }
norm_test! { is_nfkc, nfkc }

This produces the following output:

is_nfd("a\u{300}"):
 - (61, 61)
 - (300, 300)
 is_norm: true
is_nfkd("a\u{300}"):
 - (61, 61)
 - (300, 300)
 is_norm: true
is_nfc("a\u{300}"):
 - (61, e0)
 is_norm: false
is_nfkc("a\u{300}"):
 - (61, e0)
 is_norm: false
is_nfd("\u{e0}"):
 - (e0, 61)
 is_norm: false
is_nfkd("\u{e0}"):
 - (e0, 61)
 is_norm: false
is_nfc("\u{e0}"):
 - (e0, e0)
 is_norm: true
is_nfkc("\u{e0}"):
 - (e0, e0)
 is_norm: true

So "a\u{300}" is NFD and NFKD, whilst "\u{e0}" is NFC and NFKC. I don't know of any examples which differ between the K and non-K variants, though the Unicode FAQ on Normalization will probably explain things better than I can.