I tried using pyparsing
to parse a CSV with:
I worked a solution but it seems "dirty". Mainly, the Optional
inside only one of the possible atomics. I think the optional should be independent of the atomics. That is, I feel it should be put somewhere else, for example in the delimitedList
optional arguments, but in my trial and error that was the only place that worked and made sense. It could be in any of the possible atomics so I chose the first.
Also, I don't fully understand what originalTextFor
is doing but if I remove it it stops working.
Working example:
import pyparsing as pp
# Function that parses a line of columns separated by commas and returns a list of the columns
def fromLineToRow(line):
sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]") # matches "a[1,2]"
parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")") # matches "a(1,2)"
# In the following line:
# * The "^" means "choose the longest option"
# * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values
atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col))) ^ pp.originalTextFor(pp.OneOrMore(sqbrackets_col))
grammar = pp.delimitedList(atomic)
row = grammar.parseString(line).asList()
return row
file_str = \
"""YEAR,a(2,3),b[3,4]
1960,2.8,3
1961,4,
1962,,1
1963,1.27,3"""
for line in file_str.splitlines():
row = fromLineToRow(line)
print(row)
Prints:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']
Is this the right way to do this? Is there a "cleaner" way to use the Optional
inside the first atomic?
Working inside-out, I get this:
# chars not in ()'s or []'s - also disallow ','
non_grouped = pp.Word(pp.printables, excludeChars="[](),")
# grouped expressions in ()'s or []'s
grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")")
# use OneOrMore to allow non_grouped and grouped together
atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped))
# or based on your examples, you *could* tighten this up to:
# atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))
originalTextFor
recombines the original input text within the leading and trailing boundaries of the matched expressions, and returns a single string. If you leave this out, then you will get all the sub-expressions in a nested list of strings, like ['a',['2,3']]
. You could rejoin them with repeated calls to ''.join
, but that would collapse out whitespace (or use ' '.join
, but that has the opposite problem of potentially introducing whitespace).
To optionalize the elements of the list, just say so in the definition of the delimited list:
grammar = pp.delimitedList(pp.Optional(atomic, default=''))
Be sure to add the default value, else the empty slots will just get dropped.
With these changes I get:
['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']