Search code examples
pythonparsingcompiler-construction

Simple parser, but not a calculator


I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.

I have a very simple DSL, for example:

ELEMENT TYPE<TYPE> elemName {
    TYPE<TYPE> memberName;
}

Where the <TYPE> part is optional and valid only for some types.

Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:

  1. How do I look for tokens that are longer than 1 char?
  2. How do I break up the text in the different parts? For example, after a TYPE I can have a whitespace or a < or a whitespace followed by a <. How do I address that?

Solution

  • Short answer

    All your questions boil down to the fact that you are not tokenizing your string before parsing it.

    Long answer

    The process of parsing is actually split in two distinct parts: lexing and parsing.

    Lexing

    What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?

    You can do it by yourself by checking your string against a list of regexp using re, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.

    So proceeding with ComPyl, the syntax you are looking for seems to be the following.

    from compyl.lexer import Lexer
    
    rules = [
        (r'\s+', None),
        (r'\w+', 'ID'),
        (r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
        (r'{', 'L_BRACKET'),
        (r'}', 'R_BRACKET'),
    ]
    
    lexer = Lexer(rules=rules, line_rule='\n')
    # See ComPyl doc to figure how to proceed from here
    

    Notice that the first rule (r'\s+', None), is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re implementation.

    Parsing

    You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).

    Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.