Search code examples
parsingutf-8rustnom

Rust - How to parse UTF-8 alphabetical characters in nom?


I am trying to parse character sequences of alphabetical characters, including german umlauts (ä ö ü) and other alphabetical characters from the UTF-8 charset. This is the parser I tried first:

named!(
    parse(&'a str) -> Self,
    map!(
        alpha1,
        |s| Self { chars: s.into() }
    )
);

But it only works for ASCII alphabetical characters (a-zA-Z). I tried to perform the parsing char by char:

named!(
    parse(&str) -> Self,
    map!(
        take_while1!(nom::AsChar::is_alpha),
        |s| Self { chars: s.into() }
    )
);

But this won't even parse "hello", but result in an Incomplete(Size(1)) error:

How do you parse UTF-8 alphabetical characters in nom? A snippet from my code:

extern crate nom;

#[derive(PartialEq, Debug, Eq, Clone, Hash, Ord, PartialOrd)]
pub struct Word {
    chars: String,
}

impl From<&str> for Word {
    fn from(s: &str) -> Self {
        Self {
            chars: s.into(),
        }
    }
}

use nom::*;
impl Word {
    named!(
        parse(&str) -> Self,
        map!(
            take_while1!(nom::AsChar::is_alpha),
            |s| Self { chars: s.into() }
        )
    );
}


#[test]
fn parse_word() {
    let words = vec![
        "hello",
        "Hi",
        "aha",
        "Mathematik",
        "mathematical",
        "erfüllen"
    ];
    for word in words {
        assert_eq!(Word::parse(word).unwrap().1, Word::from(word));
    }
}

When I run this test,

cargo test parse_word

I get:

thread panicked at 'called `Result::unwrap()` on an `Err` value: Incomplete(Size(1))', ...

I know that chars are already UTF-8 encoded in Rust (thank heavens, almighty), but it seems that the nom library is not behaving as I would expect. I am using nom 5.1.0


Solution

  • On this Github Issue a fellow contributor quickly whipped up a library (nom-unicode) to handle this nicely:

    use nom_unicode::complete::{alphanumeric1};
    
    impl Word {
        named!(
            parse(&'a str) -> Self,
            map!(
                alphanumeric1,
                |w| Self::new(w)
            )
        );
    }