Search code examples
c++regexboost

Regex for a comma separated text with optional double quotes that can contain backslash-escaped quotes


I need a regular expression that can separate a string like:

1st, 2nd=second, "3rd=third","4th = forth",,"6th=\"this, is, the, sixth\""

into

1st         // not surrounded
2nd=second  // not surrounded
3rd=third   // surrounded
4th = forth // surrounded, keep the blank in the middle
            // empty string
6th="this, is, the, sixth"    // the scaped dbl-quotes and commas in the middle should be kept

notice that if the sections do not have comma or dbl-quotes, they may or may not be surrounded, but if they have special characters, they should be surrounded, and double quotes must be scaped with a backslash. Also, empty values (like the 5th one) should be kept.

Any help would be appreciated.


Solution

  • For your provided samples the following regex would suffice.

    (?|\h*"([^\\"]*(?:\\.[^\\"]*)*)"\h*|([^,]+|(?<=,)|^(?=,)))
    

    See this demo at regex101 (the \n in the demo is just for multiline showcase)

    It's using a branch reset group to capture the desired parts by the same first group which is supported by PCRE and even boost regex (added to its ECMAScript grammar in version 1.42).

    With this pattern the following cases are covered (alternated, priority from left to right)

    1. \h*"([^\\"]*(?:\\.[^\\"]*)*)"\h* capture what's inside quoted parts surrounded by \h* any amount of horizontal space - containing any amount of escaped quotes.
    2. [^,]+ parts without quotes: One or more characters that are not a comma.
    3. (?<=,) any remaining empty spaces preceded by a comma (lookbehind).
    4. ^(?=,) if there is an empty space at the ^ start of the string/line, e.g. ,a

    Generelly it's recommended to use a csv-parser if one is available in your environment.