I've made some progress thanks to feedback from this forum ( thanks forum!). The pyparsing.Dict object dict is getting populated but silently fails when it finds decimal numbers.
given:
import pyparsing as pp
lines = '''\
(rate multiple)
(region "mountainous")
(elev 21439)
(alteleva +21439)
(altelevb -21439)
(coorda 23899.747)
(coordb +23899.747)
(coordc -23899.747)
(coordd 853.324e21)
(coorde +853.324e21)
(coordf -853.324e21)
(coordg 987.88e+09)
(coordh +987.88e+09)
(coordi -987.88e+09)
(coordj 122.45e-04)
(coordk +122.45e-04)
(coordl -122.45e-04)
'''
leftParen = pp.Literal('(')
rightParen = pp.Literal(')')
colon = pp.Literal(':')
decimalpoint = pp.Literal('.')
doublequote = pp.Literal('"')
plusorminus = pp.Literal('+') | pp.Literal('-')
exp = pp.CaselessLiteral('E')
v_string = pp.Word(pp.alphanums)
v_quoted_string = pp.Combine( doublequote + v_string + doublequote)
v_number = pp.Regex(r'[+-]?(?P<float1>\d+)(?P<float2>\.\d+)?(?P<float3>[Ee][+-]?\d+)?')
keyy = v_string
valu = v_string | v_quoted_string | v_number
item = pp.Group( pp.Literal('(').suppress() + keyy + valu + pp.Literal(')').suppress() )
items = pp.ZeroOrMore( item)
dict = pp.Dict( items)
print "dict yields: ", dict.parseString( lines).dump()
yields
- alteleva: '+21439',
- altelevb: '-21439',
- elev: '21439',
- rate: 'multiple',
- region: '"mountainous"'
Changing the order of tokens around proves the script silently fails when it hits the first decimal number, which implies there's something subtly wrong with the pp.Regex statement but I sure can't spot it.
TIA,
code_warrior
Your problem actually lies in this expression:
valu = v_string | v_quoted_string | v_number
Because v_string
is defined as the very broadly-matching expression:
v_string = pp.Word(pp.alphanums)
and because it is the first expression in valu
, it will mask v_numbers
that start with a digit. This is because the '|' operator produces pp.MatchFirst
objects, so the first expression matched (reading left-to-right) will determine which alternative is used. You can convert to using the '^' operator, which produces pp.Or
objects - the Or
class will try to evaluate all the alternatives and then go with the longest match. However, note that using Or
carries a performance penalty, since many more expressions are test for a match even when there is no chance for confusion. In your case, you can just reorder the expressions to put the least specific matching expression last:
valu = v_quoted_string | v_number | v_string
Now values will be parsed first attempting to parse as quoted strings, then as numbers, and then only if there is no match for either of these specific types, as the very general type v_string
.
A few other comments:
I personally prefer to parse quoted strings and only get the content within the quotes (It's a string, I know it already!). There used to be some confusion with older versions of pyparsing when dumping out the parsed results when parsed strings were displayed without any enclosing quotes. But now that I use repr() to show the parsed values, strings show up in quotes when calling dump()
, but the value itself does not include the quotes. When it is used elsewhere in the program, such as saving to a database or CSV, I don't need the quotes, I just want the string content. The QuotedString
class takes care of this for me by default. Or use pp.quotedString().addParseAction(pp.removeQuotes)
.
A recent pyparsing release introduced the pyparsing_common
namespace class, containing a number of helpful pre-defined expressions. There are several for parsing different numeric types (integer, signed integer, real, etc.), and a couple of blanket expressions: number
will parse any numeric type, and produce values of the respective type (real
will give a float, integer
will give an int, etc.); fnumber
will parse various numerics, but return them all as floats. I've replaced your v_number
expression with just pp.pyparsing_common.number()
, which also permits me to remove several other partial expressions that were defined just for building up the v_number
expression, like decimalpoint
, plusorminus
and exp
. You can see more about the expressions in pyparsing_common
at the online docs: https://pythonhosted.org/pyparsing/
Pyparsing's default behavior when processing literal strings in an expression like "(" + pp.Word(pp.alphas) + valu + ")"
is to automatically convert the literal "(" and ")" terms to pp.Literal
objects. This prevents accidentally losing parsed data, but in the case of punctuation, you end up with many cluttering and unhelpful extra strings in the parsed results. In your parser, you can replace pyparsing's default by calling pp.ParserElement.inlineLiteralsUsing
and passing the pp.Suppress
class:
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
Now you can write an expression like:
item = pp.Group('(' + keyy + valu + ')')
and the grouping parentheses will be suppressed from the parsed results.
Making these changes, your parser now simplifies to:
import pyparsing as pp
# override pyparsing default to suppress literal strings in expressions
pp.ParserElement.inlineLiteralsUsing(pp.Suppress)
v_string = pp.Word(pp.alphanums)
v_quoted_string = pp.QuotedString('"')
v_number = pp.pyparsing_common.number()
keyy = v_string
# define valu using least specific expressions last
valu = v_quoted_string | v_number | v_string
item = pp.Group('(' + keyy + valu + ')')
items = pp.ZeroOrMore( item)
dict_expr = pp.Dict( items)
print ("dict yields: ", dict_expr.parseString( lines).dump())
And for your test input, gives:
dict yields: [['rate', 'multiple'], ['region', 'mountainous'], ['elev', 21439],
['alteleva', 21439], ['altelevb', -21439], ['coorda', 23899.747], ['coordb',
23899.747], ['coordc', -23899.747], ['coordd', 8.53324e+23], ['coorde',
8.53324e+23], ['coordf', -8.53324e+23], ['coordg', 987880000000.0], ['coordh',
987880000000.0], ['coordi', -987880000000.0], ['coordj', 0.012245], ['coordk',
0.012245], ['coordl', -0.012245]]
- alteleva: 21439
- altelevb: -21439
- coorda: 23899.747
- coordb: 23899.747
- coordc: -23899.747
- coordd: 8.53324e+23
- coorde: 8.53324e+23
- coordf: -8.53324e+23
- coordg: 987880000000.0
- coordh: 987880000000.0
- coordi: -987880000000.0
- coordj: 0.012245
- coordk: 0.012245
- coordl: -0.012245
- elev: 21439
- rate: 'multiple'
- region: 'mountainous'