I want to parse some C Code with PLY. What I want to extract is the following:
{ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}
This structure can be hidden in some more curly braces.
{SOME, RANDOM, STUFF {ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}, SOME, MORE, RANDOM, STUFF }
Currently I am able to lex for the structure I want to extract ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4
but only if its the only match.
{SOME, RANDOM, STUFF {ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}, SOME, MORE, RANDOM, STUFF }{Argument1, Argument2, Argument3, Argument4}
This is where my current approach fails as the lexing output for above example would be:
ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}, SOME, MORE, RANDOM, STUFF }{Argument1, Argument2, Argument3, Argument4
How can I only receive following:
ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4
Argument1, Argument2, Argument3, Argument4
Short explanation:
I do have a conditional lexer which searches for left curly braces to save its position.
For each new left brace I increment a counter.
For each right brace i decrement the counter.
If the counter is zero, I start to set t.value
to all the elements from the latest left brace to the following right brace.
I guess that should work for more than one hit in an example string.
In my opinion, I fail to switch back from ccode
state to initial
state.
Now to my actual code (in this example i left out the commas in curly braces to make it a bit simpler for me to program):
import ply.lex as lex
import ply.yacc as yacc
# Declare the state
states = (
('ccode', 'exclusive'),
)
tokens = [
'TEXT',
'CCODE'
]
# this saves all rbrace positions
# to get the inner curly brace construct you want to use first element
# text lib call should always be the inner curly brace construct
rbrace_positions = []
def t_ANY_TEXT(t):
r'\w+'
t.value = str(t.value)
return t
# Match the first {. Enter ccode state.
def t_ccode(t):
r'\{'
t.lexer.code_start = t.lexer.lexpos # Record the starting position
print(t.lexer.code_start)
t.lexer.level = 1 # Initial brace level
t.lexer.begin('ccode') # Enter 'ccode' state
def t_lbrace(t):
r'\{'
t.lexer.level += 1
def t_rbrace(t):
r'\}'
t.lexer.level -= 1
# Rules for the ccode state
def t_ccode_lbrace(t):
r'\{'
t.lexer.current_lbrace = t.lexer.lexpos
t.lexer.level += 1
def t_ccode_rbrace(t):
r'\}'
rbrace_positions.append(t.lexer.lexpos)
t.lexer.level -= 1
# If closing brace, return the code fragment
if t.lexer.level == 0:
t.value = t.lexer.lexdata[t.lexer.current_lbrace:rbrace_positions[0]-1]
t.type = "CCODE"
t.lexer.lineno += t.value.count('\n')
t.lexer.begin('INITIAL')
for _ in reversed(rbrace_positions):
rbrace_positions.pop()
return t
# C or C++ comment (ignore)
def t_ccode_comment(t):
r'(/\*(.|\n)*?\*/)|(//.*)'
pass
# C string
def t_ccode_string(t):
r'\"([^\\\n]|(\\.))*?\"'
# C character literal
def t_ccode_char(t):
r'\'([^\\\n]|(\\.))*?\''
# Any sequence of non-whitespace characters (not braces, strings)
def t_ccode_nonspace(t):
r'[^\s\{\}\'\"]+'
# Ignored characters (whitespace)
t_ccode_ignore = " \t\n"
# For bad characters, we just skip over it
def t_ccode_error(t):
t.lexer.skip(1)
def t_error(t):
t.lexer.skip(1)
lexer = lex.lex()
data = '''{ I DONT WANT TO RECEIVE THIS
{THIS IS WHAT I WANT TO SEE}
AS WELL AS I DONT WANT TO RECEIVE THIS}
OUTSIDE OF CURLY BRACES
{I WANT TO SEE THIS AGAIN}
'''
lexer.input(data)
for tok in lexer:
print(tok)
Data is just a test string to have an easy example.
But in my C source files there are some constructs where I want to extract Argument1, Argument2, Argument3, Argument4
.
Apparently those C files will not compile but there is no need to since they are included in some other files.
Thank you for all of your input!
Your description is not really clear. Your example seems to indicate that you want to find a braced list which doesn't contain any sublists. So that's the question I'm addressing.
Note that trying to do all this work in the lexer is not generally recommended. Lexers should normally return simple atomic tokens, leaving it to the parser's grammar to do the work of putting the tokens together into a useful structure. But if I've got your use case right, it is possible to do this with the lexer.
You code decides whether or not to return a CCODE token based on whether the depth counter is 0 when it hits a close brace. But that's apparently not what you want: you don't care how deeply nested the braces are; rather, when a closing brace is encountered, you want to know whether it's the innermost brace or not. You don't need a stack for that, since you only ever need the position of the last open brace read, and you only need that while it is unclosed. So every time you see an open brace, you set the last open brace position, and when you see a closing brace, you check whether the last open brace position is set. If it is, you can return the string since that position and set the last open brace position to None
. If it is not set, then just continue the scan.
Here's a simplified example based on your code:
import ply.lex as lex
# Declare the state
states = (
('ccode', 'exclusive'),
)
tokens = [
'TEXT',
'CCODE'
]
# Changed from t_ANY_TEXT because otherwise you get all the text inside
# braces as well. Perhaps that's what you wanted but it makes the output less
# clear.
def t_TEXT(t):
r'\w+'
t.value = str(t.value)
return t
# Match the first {. Enter ccode state.
def t_ccode(t):
r'\{'
t.lexer.current_open = t.lexer.lexpos # Record the starting position
t.lexer.level = 1 # Initial brace level
t.lexer.begin('ccode') # Enter 'ccode' state
# t_lbrace and t_rbrace deleted because they never match
# Rules for the ccode state
def t_ccode_lbrace(t):
r'\{'
t.lexer.current_open = t.lexer.lexpos
t.lexer.level += 1
def t_ccode_rbrace(t):
r'\}'
t.lexer.level -= 1
if t.lexer.level == 0:
t.lexer.begin('INITIAL')
if t.lexer.current_open is not None:
t.value = t.lexer.lexdata[t.lexer.current_open:t.lexer.lexpos - 1]
t.type = "CCODE"
t.lexer.current_open = None
return t
# C or C++ comment (ignore)
def t_ccode_comment(t):
r'(/\*(.|\n)*?\*/)|(//.*)'
# C string
def t_ccode_string(t):
r'\"([^\\\n]|(\\.))*?\"'
# C character literal
def t_ccode_char(t):
r'\'([^\\\n]|(\\.))*?\''
# Any sequence of non-whitespace characters (not braces, strings)
def t_ccode_nonspace(t):
r'''[^\s{}'"]+''' # No need to escape inside a character class
# Ignored characters (whitespace)
t_ccode_ignore = " \t\n"
# For bad characters, we just skip over it
def t_ccode_error(t):
t.lexer.skip(1)
def t_error(t):
t.lexer.skip(1)
lexer = lex.lex()
data = '''{ I DONT WANT TO RECEIVE THIS
{THIS IS WHAT I WANT TO SEE}
AS WELL AS I DONT WANT TO RECEIVE THIS}
OUTSIDE OF CURLY BRACES
{I WANT TO SEE THIS AGAIN}
'''
lexer.input(data)
for tok in lexer:
print(tok)
Sample run:
$ python3 nested_brace.py
LexToken(CCODE,'THIS IS WHAT I WANT TO SEE',1,58)
LexToken(TEXT,'OUTSIDE',1,102)
LexToken(TEXT,'OF',1,110)
LexToken(TEXT,'CURLY',1,113)
LexToken(TEXT,'BRACES',1,119)
LexToken(CCODE,'I WANT TO SEE THIS AGAIN',1,152)