Im trying to parse a single of a csv file. Curently it is done with some online regex webpage but in the end it has to be implemented in c#. (as reaction of some question in the comments)
I read a lot of other articels here on SO to figure it out by myself, but im stuck in solving it.
My test line for my RegExp looks like this (UPDATE: quotes escaped inside of quoted-strings):
;;"test123;weiterer Text";;"Test mit " Zeichen im Spaltenwert";nächste Spalte mit " Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo"test"
;;"test123;weiterer Text";;"Test mit "" Zeichen im Spaltenwert";nächste Spalte mit "" Begrenzungszeichen;"4711";irgendwas 123,4;1222;"foo""test"
Problem:
What i have done so far with several googling and my limited understanding of regular expressions is this expression
(?<=^|;)(\".\"|[^;]*)|[^;]+
This gives following result
[0] =>
[1] =>
[2] => "test123
[3] => weiterer Text"
[4] =>
[5] => "Test mit " Zeichen im Spaltenwert"
[6] => nächste Spalte mit " Begrenzungszeichen
[7] => "4711"
[8] => irgendwas 123,4
[9] => 1222
[10] => "foo"test"
Tested with https://www.myregextester.com/
The problem i have now is at the elements 2 and 3. This text
"test123;weiterer Text"
has to be one column but gets splited at the semicolon inside of the quoted string, although i thought i told the expression to match everysthing inside of quotation marks.
Any help here is highly appreciated. Thanks in advance.
Assuming a proper csv that uses doubled quotes for escaping (""
), that is read line by line you can use
"(?:[^"]+|"")*"|[^;]+|(?<=;|^)(?=;|$)
Basically three different ways to match a column:
"(?:[^"]+|"")*"
starting and closing quote with non-quotes or double quotes between[^;]+
a series of non-semikolons(?<=;|^)(?=;|$)
an empty field between semikolons or between semikolon and start/endNote:
\n
in the negated character classesSee https://regex101.com/r/twKZVN/1
(While regex 101 tests a PCRE pattern, all features used are also available in a .net pattern.