Search code examples
parsinglanguage-agnosticglr

Implementing "*?" (lazy "*") regexp pattern in combinatorial GLR parsers


I have implemented combinatorial GLR parsers. Among them there are:

  • char(·) parser which consumes specified character or range of characters.
  • many(·) combinator which repeats specified parser from zero to infinite times.

Example: "char('a').many()" will match a string with any number of "a"-s.

But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.

I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?

Edit:

May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.

Parser = f(Step) -> Collection of TreeNodes of Steps.

So when I parse input, I:

  • Compose simple predefined Parser functions to get complex Parser function representing required grammar.

  • Form initial Step from the input.

  • Give the initial Step to the complex Parser function.

  • Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).

  • Gather information from Steps which were left.


Solution

  • Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?

    Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?

    This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.

    One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.