Parsing a CSV: Edge Cases

Disclaimer: I'm not actually parsing a CSV, but a CSV-like format; I'm not interested in using a pre-built library.

What is the proper way to parse the following 2 lines?:

a,b"c"d,e
a,"bc"d,e

i.e., (a) with quotes in the middle of a value, and (b) with a quote at the start, but no end quote immediately before the next value.

I can't figure out how I should handle these cases (that would be the most intuitive).

My thoughts are that (a) should be parsed as a,b"c"d,e (quotes left in), and (b) should be parsed the same way, a,"bc"d,e. But then let me introduce a 3rd case, a,"b,c"d,e -- do we split on that 2nd comma between "b" and "c" or not?

Solution

Here is how you would parse it if you want to be consistent with Excel:

input:

a,b"c"d,e
a,"bc"d,e
a,"b,c"d,e

parsed (in JSON):

[
  ["a", "b\"c\"d", "e"],
  ["a", "bcd", "e"],
  ["a","b,cd", "e"]
]

The parsing logic is:

if the first char of the row, or immediately after a comma, is a dbl-quote, then:
- treat everything following it as "quoted" until you reach the closing quote (of course, treat two-quotes-in-a-row as an escaped dbl-quote). Remove the opening and the closing quote from the cell value
- after the "end quote" has been reached, include anything after the end quote in the cell value as a literal value, until a comma or newline has been reached
otherwise, treat everything after the comma as a literal value, until a comma or newline has been reached

Note that this means that if you have a space after a cell-delimiting comma, followed by a dbl-quote, you get a different result than if you have no space after the comma (followed by a dbl-quote)