I am writing a parser by using PLY. The question is similar to this one How to write a regular expression to match a string literal where the escape is a doubling of the quote character?. However, I use double-quote to open and close a string. For example:
"I do not know what \"A\" is"
I define the normal string lexer as:
t_NORMSTRING = r'"([^"\n]|(\\"))*"$'
and I have another lexer for a variable:
def t_VAR(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
The problem is my lexer doesn't recognize "I do not know what \"A\" is" as a NORMSTRING token. It returns the error
Illegal character '"' at 1
Syntax error at 'LexToken(VAR,'do',10,210)'
Please let me know why it is not correct.
Having explored this issue with a little PLY program, I think your issue is related to the differences between handling raw and non-raw strings in the data handling, and not with the PLY parsing and lexical matching itself. (Just as a side note, there are minor differences between python V2 and python v3 in this area of string handling. I have restricted my code to python v2).
You only get the error you are seeing if you use a non-raw string or use input
instead of raw_input
. This is shown from my example code and results below:
Commands:
$ python --version Python 2.7.5 $ python string.py
import sys
if ".." not in sys.path: sys.path.insert(0,"..")
import ply.lex as lex
tokens = (
'NORMSTRING',
'VAR'
)
def t_NORMSTRING(t):
r'"([^"\n]|(\\"))*"$'
print "String: '%s'" % t.value
def t_VAR(t):
r'[a-zA-Z_][a-zA-Z_0-9]*'
t_ignore = ' \t\r\n'
def t_error(t):
print "Illegal character '%s'" % t.value[0]
t.lexer.skip(1)
lexer = lex.lex()
data = r'"I do not know what \"A\" is"'
print "Data: '%s'" % data
lexer.input(data)
while True:
tok = lexer.token()
if not tok: break
print tok
Output:
Data: '"I do not know what \"A\" is"' String: '"I do not know what \"A\" is"'
data = '"I do not know what \"A\" is"'
print "Data: '%s'" % data
lexer.input(data)
while True:
tok = lexer.token()
if not tok: break
print tok
Output:
Data: '"I do not know what "A" is"' Illegal character '"' Illegal character '"' String: '" is"'
lexer.input(raw_input("Please type your line: "));
while True:
tok = lexer.token()
if not tok: break
print tok
Output:
Please type your line: "I do not know what \"A\" is" String: '"I do not know what \"A\" is"'
lexer.input(input("Please type your line: "));
while True:
tok = lexer.token()
if not tok: break
print tok
Output:
Please type your line: "I do not know what \"A\" is" Illegal character '"' Illegal character '"'
As a final note, You probably do not need the string anchor $
in your regular expression.