Search code examples
pythonpyparsing

python pyparsing non match (keyword) criteria for Word


I'm trying to create a parser which parses different kinds of expressions consisting of verilog strings and quoted strings. To get this to work, I am using the MatchFirst construct. One hiccup I am encountering is I don't know how to create a Word which doesn't match if followed by certain characters.

The short version of the problem

Let's assume I want a Word that can accept the characters 'A' and 'B' but not if they are followed by any other letter. So these should match:

A
AB
BA
BAABBABABABA

But this shouldn't match: BABC

Currenly, the parser ends up partially matching which is messing up the result.

The long version of the problem

This question is related to a previous question I have asked: python pyparsing "^" vs "|" keywords

Below is a python3 testcase illustrating the problem. NOTE If I had to change the parser from using the MatchFirst construct into the OR, the testcase passes. I.e. parser = (_get_verilog_num_parse() ^ pp.Literal("Some_demo_literal")) ^ pp.quotedString instead of parser = (_get_verilog_num_parse() ^ pp.Literal("Some_demo_literal")) | pp.quotedString but again, this forms part of a more complex parser and (I think) I need the priority to get it to work.

So ultimately, the question is how can I get this match to work without relying on the OR's "longest" match selectivity?

TestCase

import unittest
import pyparsing as pp

def _get_verilog_num_parse():
    """Get a parser that can read a verilog number
    return: Parser for verilog numbers
    rtype: PyParsing parser object

    See this link where I got help with geting this parser to work:
    https://stackoverflow.com/questions/34258011/python-pyparsing-vs-keywords
    """
    apos           = pp.Suppress(pp.Literal("'"))
    size_num        = pp.Word(pp.nums+'_'                  ).setParseAction(lambda x:int(x[0].replace('_', ''),10))
    #dec_num        = pp.Word(pp.nums+'_'   , asKeyword=True).setParseAction(lambda x:int(x[0].replace('_', ''),10))
    dec_num        = pp.Word(pp.nums+'_'                   ).setParseAction(lambda x:int(x[0].replace('_', ''),10))
    hex_num        = pp.Word(pp.hexnums+'_', asKeyword=True).setParseAction(lambda x:int(x[0].replace('_', ''),16))
    bin_num        = pp.Word('01'+'_',       asKeyword=True).setParseAction(lambda x:int(x[0].replace('_', ''),2))

    size           = pp.Optional(size_num).setResultsName('size')


    def size_mask(parser):
        size = parser.get('size')
        if size is not None:
            return parser['value'] & ((1<<size) -1)
        else:
            return parser['value']

    radix_int = pp.ungroup(pp.CaselessLiteral('d').suppress() + dec_num |
                           pp.CaselessLiteral('h').suppress() + hex_num |
                           pp.CaselessLiteral('b').suppress() + bin_num)
    #print(radix_int)
    return (size + apos + radix_int('value')).addParseAction(size_mask)

class test_PyParsing(unittest.TestCase):
    '''Check that the Expression Parser works with the expressions
    defined in this test'''

    def test_or(self):
        """Check basic expressions not involving referenced parameters"""
        expressions_to_test = [
                ("8'd255",255),
                ("'d255",255),
                ("12'h200",0x200),
                ("'blah'","'blah'"),
                ("'HARDWARE'","'HARDWARE'"),
                ("'HA'","'HA'"),
                ("'b101010'","'b101010'"),
                ("'d1010'","'d1010'"),
                ("'1010'","'1010'"),
                ]
        parser = (_get_verilog_num_parse() ^ pp.Literal("Some_demo_literal")) | pp.quotedString
        for expr,expected in expressions_to_test:
            result = parser.parseString(expr)
            #print("result: {}, val: {}".format(result, result[0]))
            self.assertEqual(expected,result[0], "test_string: {}, expected: {}, result: {}".format(expr, expected, result[0]))

Results

self.assertEqual(expected,result[0], "test_string: {}, expected: {}, result: {}".format(expr, expected, result[0]))
AssertionError: "'HARDWARE'" != 10 : test_string: 'HARDWARE', expected: 'HARDWARE', result: 10

So here, the teststring is being interpreted as a verilog number 'HA which is 10 instead of a quoted string: 'HARDWARE'

I've tried messing around with the asKeyword keyword argument but I have not had any luck with this.

EDIT

Based on Paul's help thus far I have added additional checks within the testcase to further refine the solution. I have used Paul's suggestion of adding asKeyword=True into the definition of for hex_num which solved my original problem I then added this into the definition for bin_num as well which satisfies the added checks:

("'b101010'","'b101010'"),
("'d1010'","'d1010'"),

I then added 2 more checks:

("'d1010'","'d1010'"),
("'1010'","'1010'"),

which then fail the testcase with the following result:

self.assertEqual(expected,result[0], "test_string: {}, expected: {}, result: {}".format(expr, expected, result[0]))
AssertionError: "'d1010'" != 1010 : test_string: 'd1010', expected: 'd1010', result: 1010

The logical thing to try is to then add asKeyword=True for the definition of dec_num. Which I did but this results in the strange error:

  result = parser.parseString(expr)
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 1125, in parseString
  raise exc
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 1115, in parseString
  loc, tokens = self._parse( instring, 0 )
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 989, in _parseNoCache
  loc,tokens = self.parseImpl( instring, preloc, doActions )
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 2497, in parseImpl
  raise maxException
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 2483, in parseImpl
  ret = e._parse( instring, loc, doActions )
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 989, in _parseNoCache
  loc,tokens = self.parseImpl( instring, preloc, doActions )
File "c:\users\gkuhn\appdata\local\continuum\anaconda3\lib\site-packages\pyparsing.py", line 2440, in parseImpl
  raise maxException
pyparsing.ParseException: Expected W:(0123...) (at char 3), (line:1, col:4)

Note

Adding the asKeyword=True seems to also mess up the parsing of the numbers as opposed to the quoted strings.


Solution

  • The asKeyword argument to Word brackets the internal regular expression with '\b'. I think your addition of excludeChars argument is messing things up. Just define hex_num as:

    hex_num = pp.Word(pp.hexnums+'_', asKeyword=True).setParseAction(
                                                      lambda x:int(x[0].replace('_', ''),16))
    

    When I run your test code, this works. (I think hexnums is the only one of the 3 numerics that require this, since decimal and binary don't have any ambiguity with trailing alphabetic characters.)

    FYI - excludeChars is added to Word to simplify defining character groups of "everything in printables except ':'", or "everything in alphanums except 'Q'". (https://pythonhosted.org/pyparsing/pyparsing.Word-class.html)

    EDIT

    I think part of the issue is that we need to look at both the prefix h/d/b character and the numeric characters in a single expression in order to do the right thing with the numeric characters. We want to enforce a break after the numerics, but not between the leading prefix and the numerics. I'm afraid the best way to do this is to resort to a Regex. Here is a set of expressions that combines the prefix and numerics into an equivalent regex, and adds the trailing-but-not-leading word break:

    make_num_expr = lambda prefix,numeric_chars,radix: pp.Regex(r"[%s%s](?P<num>[%s_]+)\b" % 
                                                                    (prefix,prefix.upper(),numeric_chars)).setParseAction(
                                                                            lambda x: int(x.num.replace('_',''), radix))
    dec_num = make_num_expr('d', pp.nums, 10).setName("dec_num")
    hex_num = make_num_expr('h', pp.hexnums, 16).setName("hex_num")
    bin_num = make_num_expr('b', '01', 2).setName("bin_num")
    
    radix_int = (dec_num | hex_num | bin_num).setName("radix_int")
    

    Note the use of the named group num for the numeric field of the Regex. I also added setName calls, which are a bit more important now that Or and MatchFirst (correctly) enumerate all options i11n their exception messages.

    EDIT(2)

    Just noticed that we fail on 'HA', I think this gets resolved if you just change the order of your parser alternatives:

    parser = pp.quotedString | (_get_verilog_num_parse() ^ pp.Literal("Some_demo_literal"))