Search code examples
pythoncsvregex-lookarounds

Python CSV reader: need ignore quoted comma as delimiter


I need parse text file by comma, but not by quoted comma.
It looks like trivial task, but can't make Python do it right. Mainly because of an unquoted string prepending the quoted string, which makes it probably not well-formatted CSV, but I need it exactly this way.

Example input:

cmd,print "AA"
cmd, print "AA,BB,CC"
cmd,   print " AA, BB, CC ", separate-window

Desired result (in Python syntax):

[['cmd', 'print "AA"'], 
 ['cmd', 'print "AA,BB,CC"'], 
 ['cmd', 'print " AA, BB, CC "', 'separate-window']]

Stripping surrounding spaces is optional, once I get a proper list I can strip() each item, that's not a problem.

csv.reader splits by quoted commas too, so that I rather get ['cmd', 'print "AA', 'BB', 'CC"'].

shlex with altered .whitespace=',' and .whitespace_split=True almost does the trick, but removes quotes ['cmd', 'print AA, BB, CC']. I need retain quotes.

Thought about re.split but I have very weak understanding of how (?=) thingy works...

Found few similar topics over here, but none of the proposed answers work for me.

UPDATE: screenshot for whoever questioning if I do exactly what I describe: screenshot


Solution

  • After googling bit more and removing "python" from the query I found the solution. On some Java related topic was asked very similar question. And the answer was to use regex.
    So I adjusted for Python and here exact code that works for me:

    import re
    splitter = re.compile(r',(?=(?:[^"]*"[^"]*")*[^"]*$)')
    with open('example.txt') as csvfile:
      for padded_row in csvfile:
        stripped_row = padded_row.rstrip()
        row = splitter.split(stripped_row)
        print(row)
    

    Detailed explanation how it works
    Thanks to commenters, you actually gave me some clues how to improve my googling queries :)