Search code examples
c#regexsplitnewlinedouble-quotes

Regex split a string using newline (unless it is between double quotes)


I'm doing some delimited file handling. The first thing I need to do is get all "lines". After getting each line, I can split based on the specified delimiter. So, to get the lines I need to split a string using the various line designations (\r\n, \r, \n). The following was working until I encountered a newline within a double-quote:

return content.Split(new string[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);

So if you consider the following text (my original text escaped double quotes within double quotes with \" instead of ""), where each line is delimited by one of the line designations, and each field/column in the line is delimited by the pipe "|" character:

string s = "row1 col1|\"row1 \"\"col2a\"\"\r\nrow1 col2b\"|row1 col3\nrow2 col1|\"row2 \"\"col2a\"\"\rrow2 \"\"col2b\"\"\"|row2 col3\r\nrow3 col1|\"row3 col2a\nrow3 col2b\"|row3 col3";

Which equals the following string:

row1 col1|"row1 ""col2a""{CRLF}row1 ""col2b"""|row1 col3{CRLF}row2 col1|"row2 ""col2a""{CRLF}row2 ""col2b"""|row2 col3{CRLF}row3 col1|"row3 col2a{CRLF}row3 col2b"|row3 col3

Splitting the above with my original method results in 5 lines:

string[] result = s.Split(new string[] { "\r\n", "\r", "\n" }, StringSplitOptions.None);

But I would like splitting the above using a line delimiter (\r\n, \r, \n) to result in 3 lines:

result[0] == "row1 col1|\"row1 \"\"col2a\"\"\r\nrow1 col2b\"|row1 col3"
result[1] == "row2 col1|\"row2 \"\"col2a\"\"\rrow2 \"\"col2b\"\"\"|row2 col3"
result[2] == "row3 col1|\"row3 col2a\nrow3 col2b\"|row3 col3"

Has anyone had some luck coming up with a regex to split on lines (except within quotes)?

Here is what I ended up with, thanks to Alan:

public string[] GetLines (string fileContent) {
    Regex regex = new Regex(@"^([^""\r\n]*(?:(?:""[^""]*"")*[^""\r\n]*))", RegexOptions.Multiline);
    MatchCollection matchCollection = regex.Matches(fileContent);
    string[] result = new string[matchCollection.Count];
    for (int i = 0; i < matchCollection.Count; i++) {
        Match match = matchCollection[i];
        result[i] = match.Value;
    }
    return result;
}

Solution

  • I would use Matches() instead of Split():

    Regex r = new Regex(@"(?m)^[^""\r\n]*(?:(?:""[^""]*"")+[^""\r\n]*)*");
    MatchCollection m = r.Matches(s);
    

    The inner part, (?:(?:"[^"]*")+, matches a double-quoted string that may contain escaped quotes. The whole regex matches a line that may contain one or more double-quoted strings. Note that the inner character classes ([^"]) can match \r and \n, where the outer ones ([^"\r\n]) explicitly exclude them. The line-start anchor (^ in multiline mode) prevents spurious empty matches between real matches.

    Here's a demo. (It's in PCRE, but I've tested it in .NET, too.)