Search code examples
python-2.7lexlexerply

Use PLY to match a normal string


I am writing a parser by using PLY. The question is similar to this one How to write a regular expression to match a string literal where the escape is a doubling of the quote character?. However, I use double-quote to open and close a string. For example:

"I do not know what \"A\" is"

I define the normal string lexer as:

t_NORMSTRING = r'"([^"\n]|(\\"))*"$'

and I have another lexer for a variable:

def t_VAR(t):
   r'[a-zA-Z_][a-zA-Z_0-9]*'

The problem is my lexer doesn't recognize "I do not know what \"A\" is" as a NORMSTRING token. It returns the error

Illegal character '"' at 1
Syntax error at 'LexToken(VAR,'do',10,210)'

Please let me know why it is not correct.


Solution

  • Having explored this issue with a little PLY program, I think your issue is related to the differences between handling raw and non-raw strings in the data handling, and not with the PLY parsing and lexical matching itself. (Just as a side note, there are minor differences between python V2 and python v3 in this area of string handling. I have restricted my code to python v2).

    You only get the error you are seeing if you use a non-raw string or use input instead of raw_input. This is shown from my example code and results below:

    Commands:

    $ python --version
    Python 2.7.5
    $ python string.py
    
    import sys
    
    if ".." not in sys.path: sys.path.insert(0,"..")
    import ply.lex as lex
    tokens = (
        'NORMSTRING',
        'VAR'
    )
    
    def t_NORMSTRING(t):
         r'"([^"\n]|(\\"))*"$'
         print "String: '%s'" % t.value
    
    def t_VAR(t):
       r'[a-zA-Z_][a-zA-Z_0-9]*'
    
    t_ignore = ' \t\r\n'
    
    def t_error(t):
        print "Illegal character '%s'" % t.value[0]
        t.lexer.skip(1)
    
    lexer = lex.lex()
    
    data = r'"I do not know what \"A\" is"'
    
    print "Data: '%s'" % data
    
    lexer.input(data)
    
    while True:
       tok = lexer.token()
       if not tok: break
       print tok
    

    Output:

    Data: '"I do not know what \"A\" is"'
    String: '"I do not know what \"A\" is"'
    
    data = '"I do not know what \"A\" is"'
    
    print "Data: '%s'" % data
    
    lexer.input(data)
    
    while True:
       tok = lexer.token()
       if not tok: break
       print tok
    

    Output:

    Data: '"I do not know what "A" is"'
    Illegal character '"'
    Illegal character '"'
    String: '" is"'
    
    lexer.input(raw_input("Please type your line: "));
    
    while True:
       tok = lexer.token()
       if not tok: break
       print tok
    

    Output:

    Please type your line: "I do not know what \"A\" is"
    String: '"I do not know what \"A\" is"'
    
    lexer.input(input("Please type your line: "));
    
    while True:
       tok = lexer.token()
       if not tok: break
       print tok
    

    Output:

    Please type your line: "I do not know what \"A\" is"
    Illegal character '"'
    Illegal character '"'
    

    As a final note, You probably do not need the string anchor $ in your regular expression.