python context-free-grammar bnf pyparsing

Can a BNF handle forward consumption?

Recently I've discovered the python module pyparsing, a wonderful tool for parsing data by writing the grammar, not the parser. I'm new to the idea of context-free grammars, so please correct any false assumptions in this question.

Pyparsing can implement a BNF (Backus–Naur Form) context-free grammar. This grammar can be recursive, but can it have a forward look-ahead? I've been wondering about the answer to this since I stumbled across this question. Let me give you a concrete example. Consider the string:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Let the grammar look something like:

<number> :: __<digit>__
<block>  :: <number>(x) (<number> <number> <number> .... x times)

i.e. read the first number token, save it as x and then consume the next x numbers and group them together. The parsed line should look like:

 [[1, 2], [3, 4, 5, 6], [7, 8, 9, 10, 11, 12, 13, 14], [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]

I've written a simple python MWE not using pyparsing so it's clear what I'm trying to do:

A = range(1,31)

B, sub_b = [], []
consume  = 0

for a in A:
    if consume: 
        sub_b.append(a)
        consume -= 1
    else:
        if sub_b: B.append(sub_b)
        sub_b = [a,]
        consume = a
B.append(sub_b)
print B

Two (related) questions: Can this be done with a BNF context-free grammar? If yes/no how can I do it with pyparsing?

Solution

There is no such thing as a parametrized size in a context-free grammar-- or in a regular grammar, for that matter. Numerical parameters like your consume are not part of the CFG model, and it can be proved that getting the effect in a different way is impossible. You can write a production for any fixed length, though, so you can write productions for a block of length 1, length 2, length 3, etc.:

<block3> :: <number> <number> <number>

or, similarly, matching the length as a prefix or even as a post-fix:

<block3a> :: 3 <number> <number> <number>
<block3b> :: <number> <number> <number> 3

etc. So to do what you want, you just need a grammar containing rules of the kind, for all N you might need.

A given CFG will only include a finite number of these productions. It is mathematically impossible to write a CFG (in BNF or any other form) that can handle unlimited parametrized sizes, whether as a prefix or as a postfix. In practice, you might be able to update your CFG on the fly with new productions as needed. E.g., read a number N and create a rule blockN for your grammar if it's not already present. But there's no single CFG that can capture unbounded parametrized sizes.

Edit, since you also asked about context-sensitive grammars: That still won't do it. The problem is the use of integer arithmetic, not the class of the grammar. Any formal language on the Chomsky hierarchy is defined in terms of a finite number of symbols (tokens), and since are infinitely many integers, they cannot be assigned distinct meanings (note that your parsing procedure relies on integer arithmetic).

If you were to pre-process the length into a sequence of as many stars (* * * 4 7 10), the CFG parser is trivial:

<group> :: * <group> <number>
<group> :: * <number>

This is just the so-called a^n b^n language. You could also have symbols that mean "ten", etc. But without pre-processing, the only solution (and what your procedure or a Turing machine does in practice) is to interpret numeral notation in your grammar. E.g., parse "21" as ten ten one. I doubt this can be done in a CFG (the problem is handling arbitrarily long numerals without separate rules for millions, billions etc.), but I'm not really sure. Either way it is only interesting as an academic exercise, since using real integers is so much easier. I'm sure people have studied the properties of formal languages with integers, but I can't say anything about that.