I am trying to parse character sequences of alphabetical characters, including german umlauts (ä ö ü) and other alphabetical characters from the UTF-8 charset. This is the parser I tried first:
named!(
parse(&'a str) -> Self,
map!(
alpha1,
|s| Self { chars: s.into() }
)
);
But it only works for ASCII alphabetical characters (a-zA-Z).
I tried to perform the parsing char
by char
:
named!(
parse(&str) -> Self,
map!(
take_while1!(nom::AsChar::is_alpha),
|s| Self { chars: s.into() }
)
);
But this won't even parse "hello", but result in an Incomplete(Size(1))
error:
How do you parse UTF-8 alphabetical characters in nom? A snippet from my code:
extern crate nom;
#[derive(PartialEq, Debug, Eq, Clone, Hash, Ord, PartialOrd)]
pub struct Word {
chars: String,
}
impl From<&str> for Word {
fn from(s: &str) -> Self {
Self {
chars: s.into(),
}
}
}
use nom::*;
impl Word {
named!(
parse(&str) -> Self,
map!(
take_while1!(nom::AsChar::is_alpha),
|s| Self { chars: s.into() }
)
);
}
#[test]
fn parse_word() {
let words = vec![
"hello",
"Hi",
"aha",
"Mathematik",
"mathematical",
"erfüllen"
];
for word in words {
assert_eq!(Word::parse(word).unwrap().1, Word::from(word));
}
}
When I run this test,
cargo test parse_word
I get:
thread panicked at 'called `Result::unwrap()` on an `Err` value: Incomplete(Size(1))', ...
I know that char
s are already UTF-8 encoded in Rust (thank heavens, almighty), but it seems that the nom library is not behaving as I would expect. I am using nom 5.1.0
On this Github Issue a fellow contributor quickly whipped up a library (nom-unicode
) to handle this nicely:
use nom_unicode::complete::{alphanumeric1};
impl Word {
named!(
parse(&'a str) -> Self,
map!(
alphanumeric1,
|w| Self::new(w)
)
);
}