Search code examples
parsingrecursionrecursive-descent

Recursive descent parser easy to get explanation


Can someone explain me in simple terms what recursive descent parser is?

I am stuck trying to get it. It is really very vague explained in wikipedia.

Recursive Descent Parser is a kind of top-down parser built as a set of recursive procedures each implementing a production rule of the grammar.

So, do I get it right? The parser is a program which executes commands one by one in a predefined order and each command every time it is executed has the same meaning but it tweaks the output in some way depending on the input that means the tweaking can be different every time the input is changed.

And still I do not get why the word recursion is used in here.


Solution

  • First, a bunch of terminology.

    A parser is software that can check if textual input is syntactically correct according to some grammar. A parser might also transform the textual input into another representation that's easier for other software to use.

    A grammar is a definition of the syntax of a language. A language is a (possibly infinite) set of all the syntactically correct "sentences." A sentence is a sequence of symbols.

    Grammars are described with a set of productions. Productions are rules that tell you how to can substitute sequences of symbols with other sequences of symbols

    Now we can make this more concrete with an example: a simple language of all possible sequences of balanced parentheses. For example, the string "()" would be a member of the language, as would "()()()" and "((()))". Our language will not contain any strings of unbalanced parentheses: "(" and "())" are not part of our language.

    The grammar for this language can be written like this:

    S ::= ""
    S ::= '(' S ')' S
    

    Here, S is a non-terminal symbol, and specifically, the start symbol. Each line represents a production in the grammar. More interesting languages have more non-terminal symbols and many more productions.

    If we want to generate a valid string in our language, we start with the string S and iteratively apply production rules to replace any non-terminal symbols in the string with new sequences. When the string has no more non-terminals, we're done.

    So we start with S, and chose one of the production rules to apply. Suppose we pick the second one, we get ( S ) S. Since we still have non-terminals in the string, we have to keep going. If we replace the first S with the second rule again, we get ( ( S ) S ) S. Now let's choose the first rule, which says we can replace an S with an empty string. (I wrote it as "", but sometimes you'll see people use a Greek epsilon for that.) If we apply this rule to all the remaining Ses in the string, we end up with ( ( ) ), which is a valid sequence in the language.

    OK, but now we want to go the other way. We get a string as input and want to know if it belongs to the language. This is the work of the parser.

    For many (but not all) grammars, there's a particular style of parser implementation you can use called a recursive descent parser. The basic idea is to write functions that correspond to the productions in the grammar. Each function can call other functions to check substrings. They can even call themselves (which is where the "recursion" comes into play).

    Let's re-write our grammar a little differently:

    S ::= '(' P | ""
    P ::= S ')' S
    

    The vertical bar means "or". So S can be replaced by ( P or by an empty string. This example defines the same grammar, but we've re-organized it to more directly represent how the recursive descent parser will work.

    Now suppose we write two functions called ParseS and ParseP. These functions can see the rest of the input string and return true if the next bit of the string matches the corresponding production. In pseudo-code:

    bool ParseS() {
      if next character is '(' {
        skip the `(`
        return ParseP()
      }
      return true;  // handles the empty string
    }
    
    bool ParseP() {
      if ParseS() and the next character is `)` {
        skip the ')'
        return ParseS();
      }
      return false;
    }
    

    Together, these functions form a recursive descent parser for our language.[*] They tell us whether the input string is in the language defined by the grammar, which is the basic definition of a parser. It's recursive because ParseS can call ParseP which can call ParseS.

    [*] Well, almost. I oversimplified. A correct parser would check to make sure that there's no more input before returning the final true. As written, this one will accept any string with a prefix that is a member of the language. This is easy to fix in practice, but it would introduce distracting details into an already too-long answer.