Search code examples
pythonregexpyparsingquoting

With pyparsing, how do you parse a quoted string that ends with a backslash


I'm trying to use pyparsing to parse quoted strings under the following conditions:

  • The quoted string might contain internal quotes.
  • I want to use backslashes to escape internal quotes.
  • The quoted string might end with a backslash.

I'm struggling to define a successful parser. Also, I'm starting to wonder whether the regular expression used by pyparsing for quoted strings of this kind is correct (see my alternative regular expression below).

Am I using pyparsing incorrectly (most likely) or is there a bug in pyparsing?

Here's a script that demonstrates the problem (Note: ignore this script; please focus instead on the Update below.):

import pyparsing as pp
import re

# A single-quoted string having:
#   - Internal escaped quote.
#   - A backslash as the last character before the final quote.
txt = r"'ab\'cd\'"

# Parse with pyparsing.
# Does not work as expected: grabs only first 3 characters.
parser = pp.QuotedString(quoteChar = "'", escChar = '\\', escQuote = '\\')
toks   = parser.parseString(txt)
print
print 'txt:    ', txt
print 'pattern:', parser.pattern
print 'toks:   ', toks

# Parse with a regex just like the pyparsing pattern, but with
# the last two groups flipped -- which seems more correct to me.
# This works.
rgx = re.compile(r"\'(?:[^'\n\r\\]|(?:\\.)|(?:\\))*\'")
print
print rgx.search(txt).group(0)

Output:

txt:     'ab\'cd\'
pattern: \'(?:[^'\n\r\\]|(?:\\)|(?:\\.))*\'
toks:    ["ab'"]

'ab\'cd\'

Update

Thanks for the replies. I suspect that I've confused things by framing my question badly, so let me try again.

Let's say we are trying to parse a language that uses quoting rules generally like Python's. We want users to be able to define strings that can include internal quotes (protected by backslashes) and we want those strings to be able to end with a backslash. Here's an example file in our language. Note that the file would also parse as valid Python syntax, and if we printed foo (in Python), the output would be the literal value: ab'cd\

# demo.txt
foo = 'ab\'cd\\'

My goal is to use pyparsing to parse such a language. Is there a way to do it? The question above is basically where I ended up after several failed attempts. Below is my initial attempt. It fails because there are two backslashes at the end, rather than just one.

with open('demo.txt') as fh:
    txt = fh.read().split()[-1].strip()

parser = pp.QuotedString(quoteChar = "'", escChar = '\\')
toks   = parser.parseString(txt)
print
print 'txt:    ', txt
print 'pattern:', parser.pattern
print 'toks:   ', toks             # ["ab'cd\\\\"]

I guess the problem is that QuotedString treats the backslash only as a quote-escape whereas Python treats a backslash as a more general-purpose escape.

Is there a simple way to do this that I'm overlooking? One workaround that occurs to me is to use .setParseAction(...) to handle the double-backslashes after the fact -- perhaps like this, which seems to work:

qHandler = lambda s,l,t: [ t[0].replace('\\\\', '\\') ]
parser = pp.QuotedString(quoteChar = "'", escChar = '\\').setParseAction(qHandler)

Solution

  • What is it about this code that is not working for you?

    from pyparsing import *
    
    s = r"foo = 'ab\'cd\\'"  # <--- IMPORTANT - use a raw string literal here
    
    ident = Word(alphas)
    strValue = QuotedString("'", escChar='\\')
    strAssign = ident + '=' + strValue
    
    results = strAssign.parseString(s)
    print results.asList() # displays repr form of each element
    
    for r in results:
        print r # displays str form of each element
    
    # count the backslashes
    backslash = '\\'
    print results[-1].count(backslash)
    

    prints:

    ['foo', '=', "ab'cd\\\\"]
    foo
    =
    ab'cd\\
    2
    

    EDIT:

    So "\'" becomes just "'", but "\" is parsed but stays as "\" instead of being an escaped "\". Looks like a bug in QuotedString. For now you can add this workaround:

    import re
    strValue.setParseAction(lambda t: re.sub(r'\\(.)', r'\g<1>', t[0]))
    

    Which will take every escaped character sequence and just give back the escaped character alone, without the leading '\'.

    I'll add this in the next patch release of pyparsing.