I'm trying to write a parser for a filetype that utilizes keyword pairs (separated by a space) and am struggling with the correct way to do this. Some examples of tokens might be:
angle spring
angle dampen
angle collision
There are also block definitions and tokens that end that block, for example:
dynamics
angle spring 1.0
angle dampen 0.0
angle collision 0.0
some 1 2 3
more ['stuff' 'here']
tokens "values can be strings, paths, etc"
end dynamics
Newlines seem to be significant, I've been using that to determine if I'm looking at a keyword or just a regular old string (keywords should be the first token on each line). Am I approaching this the right way? Should I instead just tokenize everything and define pairs more rigorously during the yacc stage?
Thanks for your time!
The problem is that you are trying to treat what is logically a single token as multiple tokens. If a keyword contains spaces it means that the spaces are part of the keyword token.
If you define your keyword tokens including the spaces you wont ever need to handle them in the parser. Which means you should divide keyword matching from the normal identifier matching.
For example:
from ply.lex import TOKEN
KEYWORDS = [
r'some', r'keyword',
r'keyword with token',
r'other keyword',
]
keyword = '|'.join(keyword.replace(' ', '\s+') for keyword in KEYWORDS)
@TOKEN(keyword)
def t_KEYWORD(t):
# remove spaces
value = ''.join(x for x in t.value if not x.isspace())
return value.upper()
Note the @TOKEN(keyword)
line: you can set the docstring of a function dynamically using the TOKEN
decorator. This allow for complex regexes to be used for defining tokens, even if defining them "requires" using expression and not simple string literals.
The alternative is to treat the space-separated keywords as multiple keywords. So you keep the usual definition for identifiers and keywords and modify your grammar to use multiple keywords instead of one.
For example you'd have a grammar rule like:
def p_dynamics(p):
'DYNAMICS BLOCK END DYNAMICS'
instead of:
def p_dynamics(p):
'DYNAMICS BLOCK END_DYNAMICS'
Depending on the contraints you have one solution could be easier to implement.