Search code examples
rustnom

How to parse slightly ambiguous data using nom?


In RFC1738, the BNF for domainlabel is the following:

domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit

That is, it's either an alphadigit, or it's a string where the first/last characters have to be an alphadigit but the intermediate characters can be an alphadigit or a dash.

How do I implement this with nom? Ignoring the single character scenario to simplify the case, my final attempt is:

fn domain_label(s: &[u8]) -> IResult<&[u8], (&[u8], &[u8], &[u8])> {
    let left = take_while_m_n(1, 1, is_alphanumeric);
    let middle = take_while(|c| is_alphanumeric(c) || c == b'-');
    let right = take_while_m_n(1, 1, is_alphanumeric);
    let whole = tuple((left, middle, right));
    whole(s)
}

The problem with this is that middle can consume the last character and hence right fails because there is no character to consume.

println!("{:?}", domain_label(b"abcde"));
Err(Error(([], TakeWhileMN)))

Parsers should be able to attempt all possible consumption paths, but how to do this with nom?


Solution

  • domainlabel = alphadigit | alphadigit *[ alphadigit | "-" ] alphadigit

    It is a series of alphanumeric sequence delimited by any number of character -. So here is one way to do it:

    use nom::bytes::complete::{tag, take_while1};
    use nom::character::is_alphanumeric;
    use nom::combinator::recognize;
    use nom::multi::{many1, separated_list};
    use nom::IResult;
    
    fn domain_label(input: &[u8]) -> IResult<&[u8], &[u8]> {
        let alphadigits = take_while1(is_alphanumeric);
        let delimiter = many1(tag(b"-"));
        let parser = separated_list(delimiter, alphadigits);
    
        recognize(parser)(input)
    }
    
    fn main() {
        let (_, res) = domain_label(b"abcde").unwrap();
        assert_eq!(res, b"abcde");
        let (_, res) = domain_label(b"abcde-123-xyz-").unwrap();
        assert_eq!(res, b"abcde-123-xyz");
        let (_, res) = domain_label(b"rust-lang--1---37---0.org").unwrap();
        assert_eq!(res, b"rust-lang--1---37---0");
    }
    

    Notice, you don't need individual parts of a successful parsing. The result is just the longest input that conforms to the domain label BNF. That's where the recognize combinator comes in.