Search code examples
recursionbnfformal-languages

Recursion definitions in BNF


I've recentrly started studying programming languages, and I'm trying to get better at understanding recursive definitions in BNF.

For example, if we consider identifier as anything that consists of letters and digits, but always starts with a letter, the definition for such identifier would be

<identifier> ::= <letter> | <identifier> <character>
<character>  ::= <digit> | <letter>

But it gets interesting when we consider a definition of number. Should it be this

<number> ::= <digit> | <number> <digit>

or this?

<number> ::= <digit> | <digit> <number>

– syntactically they both seem to be correct, but my guts say that their semantics would differ – one number would be 0123456789, and another – 9876543210. Am I wrong and both number definitions are valid?


Solution

  • Both definitions of a number are correct, if you allow numbers to start with leading zeros. Since we often don't allow that, a more common definition is:

    <non-zero-number> ::= <non-zero-digit> | <non-zero-number> <digit>
    <number> ::= 0 | <non-zero-number>
    

    Both this and the <identifier> definition are left-recursive because they describe strings whose leftmost character is special (because its possible values are limited). If we were trying to describe a set of strings whose last (rightmost) characters are restricted, we would likely use right-recursion.

    Since we read left to right, there are a lot more syntaxes whose leftmost elements are special than syntaxes whose rightmost elements are special. (And parsing algorithms, which also work from left to right, tend to be optimised for this common case.)

    (I am completely aware that not all people read from left to right. Many of us read from right-to-left, and it is not my intention to suggest that one or the other is better. It would be more culturally neutral to talk about reading or parsing "from beginning to end", and to distinguish between "leading recursion" and "trailing recursion". Indeed, in a different CS context which has less to do with languages, it is common to talk about "tail recursion"; that is, recursion which is done at the very end of a function. But the mathematicians responsible for modern parsing theory were writing in English and the left-to-right bias crept in to the terminology.)

    Since both left- and right-recursion are equally adept at describing sets of strings whose character positions are undistinguished, we could use either to describe numbers, as in your question. We could even use a right-recursive grammar to describe left-distinguished strings, by separating off the first position and gluing it back on later:

    <suffix> ::=  <empty> | <letter> <suffix> | <digit> <suffix>
    <identifier> ::= <letter> <suffix>
    

    But that feels unnatural, since an identifier really is just a single thing.