Search code examples
rustunicodenfd

What is the standard method in rust to decompose a unicode character? (remove accents and other marks)


What is the correct method in rust, to take a character such as ἄ return the normal α without accents? (For example, do a unicode nfc to nfd conversion so that the α is separated from the ᾽ and ´)

I suspect this rust documentation page reveals the function I need, but there is no example code which gives me a clue how to use it.

https://docs.rs/unicode-normalization/latest/unicode_normalization/char/fn.decompose_canonical.html

I know this rather hacky convoluted code works, but it seems not the correct option:

let s:Vec<char> = c.to_string().nfkd().collect();
s[0] // <--- unaccented

Solution

  • The function you pass in is called for each character in the decomposition. The first character it's called for is the one you are interested in. Example code:

    use unicode_normalization::char::decompose_canonical;
    
    fn main () {
        let mut base_char = None;
        decompose_canonical('ἄ', |c| { base_char.get_or_insert(c); });
        dbg!(base_char);
    }
    

    Playground