Search code examples
c#regexcsvdouble-quotes

Match CSV line with semicolons and quotation inside a quoted string


Im trying to parse a single of a csv file. Curently it is done with some online regex webpage but in the end it has to be implemented in c#. (as reaction of some question in the comments)

I read a lot of other articels here on SO to figure it out by myself, but im stuck in solving it.

My test line for my RegExp looks like this (UPDATE: quotes escaped inside of quoted-strings):

;;"test123;weiterer Text";;"Test mit " Zeichen im Spaltenwert";nächste Spalte mit " Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo"test"

;;"test123;weiterer Text";;"Test mit "" Zeichen im Spaltenwert";nächste Spalte mit "" Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo""test"
  • ; is the delimiter
  • " is the sign for quoted columns

Problem:

  • the line may contain empty columns (semicolon followed by semicolon without any text)
  • quoted strings may contain the quote sign, like here "Test mit " Zeichen im Spaltenwert"
  • the column delimiter may occure also in quoted strings, like here: "test123;weiterer Text"

What i have done so far with several googling and my limited understanding of regular expressions is this expression

(?<=^|;)(\".\"|[^;]*)|[^;]+

This gives following result

        [0] => 
        [1] => 
        [2] => "test123
        [3] => weiterer Text"
        [4] => 
        [5] => "Test mit " Zeichen im Spaltenwert"
        [6] => nächste Spalte mit " Begrenzungszeichen
        [7] => "4711"
        [8] => irgendwas 123,4
        [9] => 1222
        [10] => "foo"test"

Tested with https://www.myregextester.com/

The problem i have now is at the elements 2 and 3. This text

"test123;weiterer Text"

has to be one column but gets splited at the semicolon inside of the quoted string, although i thought i told the expression to match everysthing inside of quotation marks.

Any help here is highly appreciated. Thanks in advance.


Solution

  • Assuming a proper csv that uses doubled quotes for escaping (""), that is read line by line you can use

    "(?:[^"]+|"")*"|[^;]+|(?<=;|^)(?=;|$)
    

    Basically three different ways to match a column:

    • "(?:[^"]+|"")*" starting and closing quote with non-quotes or double quotes between
    • [^;]+ a series of non-semikolons
    • (?<=;|^)(?=;|$) an empty field between semikolons or between semikolon and start/end

    Note:

    • if you want to use this in multiline context you would have to add \n in the negated character classes
    • it doesn't handle leading or trailing spaces connected with quoted fields

    See https://regex101.com/r/twKZVN/1

    (While regex 101 tests a PCRE pattern, all features used are also available in a .net pattern.