I am trying to write a very simple parser. I read similar questions here on SO and on the Internet, but all I could find was limited to "arithmetic like" things.
I have a very simple DSL, for example:
ELEMENT TYPE<TYPE> elemName {
TYPE<TYPE> memberName;
}
Where the <TYPE>
part is optional and valid only for some types.
Following what I read, I tried to write a recursive descent parser in Python, but there are a few things that I can't seem to understand:
<
or a whitespace followed by a <
. How do I address that?All your questions boil down to the fact that you are not tokenizing your string before parsing it.
The process of parsing is actually split in two distinct parts: lexing and parsing.
What seems to be missing in the way you think about parsing is called tokenizing or lexing. It is the process of converting a string into a stream of tokens, i.e. words. That is what you are looking for when asking How do I break up the text in the different parts?
You can do it by yourself by checking your string against a list of regexp using re
, or you can use some well-known librairy such as PLY. Although if you are using Python3, I will be biased toward a lexing-parsing librairy that I wrote, which is ComPyl.
So proceeding with ComPyl, the syntax you are looking for seems to be the following.
from compyl.lexer import Lexer
rules = [
(r'\s+', None),
(r'\w+', 'ID'),
(r'< *\w+ *>', 'TYPE'), # Will match your <TYPE> token with inner whitespaces
(r'{', 'L_BRACKET'),
(r'}', 'R_BRACKET'),
]
lexer = Lexer(rules=rules, line_rule='\n')
# See ComPyl doc to figure how to proceed from here
Notice that the first rule (r'\s+', None)
, is actually what solves your issue about whitespace. It basically tells the lexer to match any whitespace character and to ignore them. Of course if you do not want to use a lexing tool, you can simply add a similar rule in your own re
implementation.
You seem to want to write your own LL(1) parser, so I will be brief on that part. Just know that there exist a lot of tools that can do that for you (PLY and ComPyl librairies offer LR(1) parsers which are more powerful but harder to hand-write, see the difference between LL(1) and LR(1) here).
Simply notice that now that you know how to tokenize your string, the issue of How do I look for tokens that are longer than 1 char? has been solved. You are now parsing, not a stream of characters, but a stream of tokens that encapsulate the matched words.