Search code examples
parsingrustnom

Parsing custom identifier with nom


I am interested to use nom parser combinators to recognize identifiers of this kind:

"a"
"a1"
"a_b"
"aA"
"aB_3_1"

The first character of the identifier should be an alphabetic lower cased character then any combination of alphanumeric character and underscore (so [a-zA-Z0-9_]*) could follow, with the restriction that a double (or more) underscore must not occurred and an underscore must not end the identifier, rejecting those cases:

"Aa"
"aB_"
"a__a"
"_a"

So far I have come with this solution but unsure about correctness of my approach:

pub fn identifier(s: &str) -> IResult<&str, &str> {
    let (i, _) = verify(anychar, |c: &char| c.is_lowercase())(s)?;
    let (j, _) = alphanumeric0(i)?;
    let (k, _) = recognize(opt(many1(preceded(underscore, alphanumeric1))))(j)?;
    Ok((k,s))
}

Also I need to wrap around a recognize this identifier parser when using it, like this:

pub fn identifier2(s: &str) -> IResult<&str, &str> {
    (recognize(identifier))(s)
}

Solution

  • Here's the variant I came up with. It's mostly the same as yours; I made the following changes:

    • Most importantly, I added all_consuming, which ensures that the entire input matches. The bug in your proposed implementation is that "aBa_" would successfully match the identifier "aBa" and leave the trailing "_" unparsed (returning it in the input side).
    • Rewrote exclusively in terms of parser combinators, rather than using ? statements.
    • Made the underscore matching optional. Nom parsers are greedy in general, so this won't lead to performance degradation.
    • Simplified to only 2 clauses, instead of 3. The parser essentially runs "match any lower case character, followed by 0 or more runs of an optional _ followed by 1 more more alphanumerics".
    • Changed many1 to many0_count, simply because the latter doesn't allocate a vector.
    • Made the function generic over the error type, allowing users of the function to use any error type they wish.
    pub fn identifier<'a, E: ParseError<&'a str>>(s: &'a str) -> IResult<&'a str, &'a str, E> {
        recognize(all_consuming(pair(
            verify(anychar, |&c| c.is_lowercase()),
            many0_count(preceded(opt(char('_')), alphanumeric1)),
        )))(s)
    }
    

    This function as written passes all test cases you provided. If you specifically don't want the all_consuming, perhaps because this is being used as part of a larger set of parsers, you'll have to manually check that the recognized identifier doesn't end in a _ character.