Search code examples
pythoncsvpyparsing

How to parse a CSV with commas between parenthesis and missing values


I tried using pyparsing to parse a CSV with:

  • Commas between parenthesis (or brackets, etc): "a(1,2),b" should return the list ["a(1,2)","b"]
  • Missing values: "a,b,,c," should return the list ['a','b','','c','']

I worked a solution but it seems "dirty". Mainly, the Optional inside only one of the possible atomics. I think the optional should be independent of the atomics. That is, I feel it should be put somewhere else, for example in the delimitedList optional arguments, but in my trial and error that was the only place that worked and made sense. It could be in any of the possible atomics so I chose the first.

Also, I don't fully understand what originalTextFor is doing but if I remove it it stops working.

Working example:

import pyparsing as pp

# Function that parses a line of columns separated by commas and returns a list of the columns
def fromLineToRow(line):
    sqbrackets_col = pp.Word(pp.printables, excludeChars="[],") | pp.nestedExpr(opener="[",closer="]")  # matches "a[1,2]"
    parens_col = pp.Word(pp.printables, excludeChars="(),") | pp.nestedExpr(opener="(",closer=")")      # matches "a(1,2)"
    # In the following line:
    # * The "^" means "choose the longest option"
    # * The "pp.Optional" can be in any of the expressions separated by "^". I put it only on the first. It's used for when there are missing values
    atomic = pp.originalTextFor(pp.Optional(pp.OneOrMore(parens_col))) ^ pp.originalTextFor(pp.OneOrMore(sqbrackets_col))

    grammar = pp.delimitedList(atomic)

    row = grammar.parseString(line).asList()
    return row

file_str = \
"""YEAR,a(2,3),b[3,4]
1960,2.8,3
1961,4,
1962,,1
1963,1.27,3"""

for line in file_str.splitlines():
    row = fromLineToRow(line)
    print(row)

Prints:

['YEAR', 'a(2,3)', 'b[3,4]']
['1960', '2.8', '3']
['1961', '4', '']
['1962', '', '1']
['1963', '1.27', '3']

Is this the right way to do this? Is there a "cleaner" way to use the Optional inside the first atomic?


Solution

  • Working inside-out, I get this:

    # chars not in ()'s or []'s - also disallow ','
    non_grouped = pp.Word(pp.printables, excludeChars="[](),")
    
    # grouped expressions in ()'s or []'s
    grouped = pp.nestedExpr(opener="[",closer="]") | pp.nestedExpr(opener="(",closer=")")
    
    # use OneOrMore to allow non_grouped and grouped together
    atomic = pp.originalTextFor(pp.OneOrMore(non_grouped | grouped))
    # or based on your examples, you *could* tighten this up to:
    # atomic = pp.originalTextFor(non_grouped + pp.Optional(grouped))
    

    originalTextFor recombines the original input text within the leading and trailing boundaries of the matched expressions, and returns a single string. If you leave this out, then you will get all the sub-expressions in a nested list of strings, like ['a',['2,3']]. You could rejoin them with repeated calls to ''.join, but that would collapse out whitespace (or use ' '.join, but that has the opposite problem of potentially introducing whitespace).

    To optionalize the elements of the list, just say so in the definition of the delimited list:

    grammar = pp.delimitedList(pp.Optional(atomic, default=''))
    

    Be sure to add the default value, else the empty slots will just get dropped.

    With these changes I get:

    ['YEAR', 'a(2,3)', 'b[3,4]']
    ['1960', '2.8', '3']
    ['1961', '4', '']
    ['1962', '', '1']
    ['1963', '1.27', '3']