Search code examples
parsingcombinatorssprache

How to handle 'line-continuation' using parser combinators


I'm trying to write a small parser using the Sprache parser combinator library. The parser should be able to parse lines ended with a single \ as insignificant white space.

Question

How can I create a parser that can parse the values after the = sign that may contain a line-continuation character \? For example

a = b\e,\
    c,\
    d

Should be parsed as (KeyValuePair (Key, 'a'), (Value, 'b\e, c, d')).

I'm new to using this library and parser combinators in general. So any pointers in the right direction are much appreciated.

What I have tried

Test

public class ConfigurationFileGrammerTest
{
    [Theory]
    [InlineData("x\\\n  y", @"x y")]
    public void ValueIsAnyStringMayContinuedAccrossLinesWithLineContinuation(
        string input, 
        string expectedKey)
    {
        var key = ConfigurationFileGrammer.Value.Parse(input);
        Assert.Equal(expectedKey, key);
    }
}

Production

Attempt one
    public static readonly Parser<string> Value =
        from leading in Parse.WhiteSpace.Many()
        from rest in Parse.AnyChar.Except(Parse.Char('\\')).Many()
            .Or(Parse.String("\\\n")
            .Then(chs => Parse.Return(chs))).Or(Parse.AnyChar.Except(Parse.LineEnd).Many())
        select new string(rest.ToArray()).TrimEnd();
Test output
Xunit.Sdk.EqualException: Assert.Equal() Failure
           ↓ (pos 1)
Expected: x y
Actual:   x\
           ↑ (pos 1)
Attempt two
    public static readonly Parser<string> SingleLineValue =
        from leading in Parse.WhiteSpace.Many()
        from rest in Parse.AnyChar.Many().Where(chs => chs.Count() < 2 || !(string.Join(string.Empty, chs.Reverse().Take(2)).Equals("\\\n")))
        select new string(rest.ToArray()).TrimEnd();

    public static readonly Parser<string> ContinuedValueLines =
        from firsts in ContinuedValueLine.AtLeastOnce()
        from last in SingleLineValue
        select string.Join(" ", firsts) + " " + last;

    public static readonly Parser<string> Value = SingleLineValue.Once().XOr(ContinuedValueLines.Once()).Select(s => string.Join(" ", s));
Test output
Xunit.Sdk.EqualException: Assert.Equal() Failure
           ↓ (pos 1)
Expected: x y
Actual:   x\\n  y
           ↑ (pos 1)

Solution

  • You must not include line continuation in the output. That's the only issue of the last unit test. When you parse the continuation \\\n you must drop it from the output result and return the empty string. Sorry I don't know how to do that using C# sprache. Maybe with something like that:

    Parse.String("\\\n").Then(chs => Parse.Return(''))
    

    I solved the problem using combinatorix python library. It's a parser combinator library. The API use functions instead of the using chained methods but the idea is the same.

    Here is the full code with comments:

    # `apply` return a parser that doesn't consume the input stream.  It
    # applies a function (or lambda) to the output result of a parser.
    # The following parser, will remove whitespace from the beginning
    # and the end of what is parsed.
    strip = apply(lambda x: x.strip())
    
    # parse a single equal character
    equal = char('=')
    
    # parse the key part of a configuration line. Since the API is
    # functional it reads "inside-out". Note, the use of the special
    # `unless(predicate, parser)` parser. It is sometime missing from
    # parser combinator libraries. What it does is use `parser` on the
    # input stream if the `predicate` parser fails. It allows to execute
    # under some conditions. It's similar in spirit to negation in prolog.
    # It does parse *anything until an equal sign*, "joins" the characters
    # into a string and strips any space starting or ending the string.
    key = strip(join(one_or_more(unless(equal, anything))))
    
    # parse a single carriage return character
    eol = char('\n')
    
    # returns a parser that return the empty string, this is a constant
    # parser (aka. it always output the same thing).
    return_empty_space = apply(lambda x: '')
    # This will parse a full continuation (ie. including the space
    # starting the new line.  It does parse *the continuation string then
    # zero or more spaces* and return the empty string
    continuation = return_empty_space(sequence(string('\\\n'), zero_or_more(char(' '))))
    
    # `value` is the parser for the value part.  Unless the current char
    # is a `eol` (aka. \n) it tries to parse a continuation, otherwise it
    # parse anything. It does that at least once, ie. the value can not be
    # empty. Then, it "joins" all the chars into a single string and
    # "strip" from any space that start or end the value.
    value = strip(join(one_or_more(unless(eol, either(continuation, anything)))))
    
    # this basically, remove the element at index 1 and only keep the
    # elements at 0 and 2 in the result. See below.
    kv_apply = apply(lambda x: (x[0], x[2]))
    
    # This is the final parser for a given kv pair. A kv pair is:
    #
    # - a key part (see key parser)
    # - an equal part (see equal parser)
    # - a value part (see value parser)
    #
    # Those are used to parse the input stream in sequence (one after the
    # other). It will return three values: key, a '=' char and a value.
    # `kv_apply` will only keep the key and value part.
    kv = kv_apply(sequence(key, equal, value))
    
    
    # This is sugar syntax, which turns the string into a stream of chars
    # and execute `kv` parser on it.
    parser = lambda string: combinatorix(string, kv)
    
    
    input = 'a = b\\e,\\\n    c,\\\n    d'
    assert parser(input) == ('a', 'b\\e,c,d')