Search code examples
pythonpyparsing

building a pyparsing.Dict from a string of multiple tokens - part II


I've made some progress thanks to feedback from this forum ( thanks forum!). The pyparsing.Dict object dict is getting populated but silently fails when it finds decimal numbers.

given:

import pyparsing as pp

lines = '''\
(rate multiple)
(region "mountainous")
(elev       21439)
(alteleva  +21439)
(altelevb  -21439)
(coorda  23899.747)
(coordb +23899.747)
(coordc -23899.747)
(coordd  853.324e21)
(coorde +853.324e21)
(coordf -853.324e21)
(coordg  987.88e+09)
(coordh +987.88e+09)
(coordi -987.88e+09)
(coordj  122.45e-04)
(coordk +122.45e-04)
(coordl -122.45e-04)
'''

leftParen    = pp.Literal('(')
rightParen   = pp.Literal(')')
colon        = pp.Literal(':')
decimalpoint = pp.Literal('.')
doublequote  = pp.Literal('"')
plusorminus  = pp.Literal('+') | pp.Literal('-') 
exp          = pp.CaselessLiteral('E')

v_string = pp.Word(pp.alphanums)
v_quoted_string = pp.Combine( doublequote + v_string + doublequote)
v_number = pp.Regex(r'[+-]?(?P<float1>\d+)(?P<float2>\.\d+)?(?P<float3>[Ee][+-]?\d+)?')

keyy = v_string
valu = v_string | v_quoted_string | v_number

item  = pp.Group( pp.Literal('(').suppress() + keyy + valu + pp.Literal(')').suppress() )
items = pp.ZeroOrMore( item)
dict = pp.Dict( items)

print "dict yields: ",  dict.parseString( lines).dump()

yields

- alteleva: '+21439',
- altelevb: '-21439',
- elev: '21439',
- rate: 'multiple',
- region: '"mountainous"'

Changing the order of tokens around proves the script silently fails when it hits the first decimal number, which implies there's something subtly wrong with the pp.Regex statement but I sure can't spot it.

TIA,

code_warrior


Solution

  • Your problem actually lies in this expression:

    valu = v_string | v_quoted_string | v_number
    

    Because v_string is defined as the very broadly-matching expression:

    v_string = pp.Word(pp.alphanums)
    

    and because it is the first expression in valu, it will mask v_numbers that start with a digit. This is because the '|' operator produces pp.MatchFirst objects, so the first expression matched (reading left-to-right) will determine which alternative is used. You can convert to using the '^' operator, which produces pp.Or objects - the Or class will try to evaluate all the alternatives and then go with the longest match. However, note that using Or carries a performance penalty, since many more expressions are test for a match even when there is no chance for confusion. In your case, you can just reorder the expressions to put the least specific matching expression last:

    valu = v_quoted_string | v_number | v_string
    

    Now values will be parsed first attempting to parse as quoted strings, then as numbers, and then only if there is no match for either of these specific types, as the very general type v_string.

    A few other comments:

    I personally prefer to parse quoted strings and only get the content within the quotes (It's a string, I know it already!). There used to be some confusion with older versions of pyparsing when dumping out the parsed results when parsed strings were displayed without any enclosing quotes. But now that I use repr() to show the parsed values, strings show up in quotes when calling dump(), but the value itself does not include the quotes. When it is used elsewhere in the program, such as saving to a database or CSV, I don't need the quotes, I just want the string content. The QuotedString class takes care of this for me by default. Or use pp.quotedString().addParseAction(pp.removeQuotes).

    A recent pyparsing release introduced the pyparsing_common namespace class, containing a number of helpful pre-defined expressions. There are several for parsing different numeric types (integer, signed integer, real, etc.), and a couple of blanket expressions: number will parse any numeric type, and produce values of the respective type (real will give a float, integer will give an int, etc.); fnumber will parse various numerics, but return them all as floats. I've replaced your v_number expression with just pp.pyparsing_common.number(), which also permits me to remove several other partial expressions that were defined just for building up the v_number expression, like decimalpoint, plusorminus and exp. You can see more about the expressions in pyparsing_common at the online docs: https://pythonhosted.org/pyparsing/

    Pyparsing's default behavior when processing literal strings in an expression like "(" + pp.Word(pp.alphas) + valu + ")" is to automatically convert the literal "(" and ")" terms to pp.Literal objects. This prevents accidentally losing parsed data, but in the case of punctuation, you end up with many cluttering and unhelpful extra strings in the parsed results. In your parser, you can replace pyparsing's default by calling pp.ParserElement.inlineLiteralsUsing and passing the pp.Suppress class:

    pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
    

    Now you can write an expression like:

    item  = pp.Group('(' + keyy + valu + ')')
    

    and the grouping parentheses will be suppressed from the parsed results.

    Making these changes, your parser now simplifies to:

    import pyparsing as pp
    
    # override pyparsing default to suppress literal strings in expressions
    pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
    
    v_string = pp.Word(pp.alphanums)
    v_quoted_string = pp.QuotedString('"')
    v_number = pp.pyparsing_common.number()
    
    keyy = v_string
    # define valu using least specific expressions last
    valu = v_quoted_string | v_number | v_string
    
    item  = pp.Group('(' + keyy + valu + ')')
    items = pp.ZeroOrMore( item)
    dict_expr = pp.Dict( items)
    
    print ("dict yields: ",  dict_expr.parseString( lines).dump())
    

    And for your test input, gives:

    dict yields:  [['rate', 'multiple'], ['region', 'mountainous'], ['elev', 21439], 
    ['alteleva', 21439], ['altelevb', -21439], ['coorda', 23899.747], ['coordb', 
    23899.747], ['coordc', -23899.747], ['coordd', 8.53324e+23], ['coorde', 
    8.53324e+23], ['coordf', -8.53324e+23], ['coordg', 987880000000.0], ['coordh', 
    987880000000.0], ['coordi', -987880000000.0], ['coordj', 0.012245], ['coordk', 
    0.012245], ['coordl', -0.012245]]
    - alteleva: 21439
    - altelevb: -21439
    - coorda: 23899.747
    - coordb: 23899.747
    - coordc: -23899.747
    - coordd: 8.53324e+23
    - coorde: 8.53324e+23
    - coordf: -8.53324e+23
    - coordg: 987880000000.0
    - coordh: 987880000000.0
    - coordi: -987880000000.0
    - coordj: 0.012245
    - coordk: 0.012245
    - coordl: -0.012245
    - elev: 21439
    - rate: 'multiple'
    - region: 'mountainous'