Search code examples
regexcsvtextrfc4180

Need regex help to clean up delimited data


I've got some pipe-delimited data that looks like this:

    field|field|field|field|another "field"|field
    field|field|field|"another" field|field|field
    field|"fie|ld"|field|field|field|field

And the problem that I'm having is that the double-quotes aren't properly formed for a delimited field (fields containing the delimiter, or the quotes, should itself be wrapped with double-quotes, and the double-quotes should be excaped with another double-quote). This is what the output should be:

    field|field|field|field|"another ""field"""|field
    field|field|field|"""another"" field"|field|field
    field|"fie|ld"|field|field|field|field

Anyone know if there is an easy way to do this with a regular expression?


Solution

  • Only some of this operation is suitable for regex. Other parts are better suited for regular program flow. Regex can still be an integral part of the operation, but use built-in string manipulation where possible.

    Using PCRE:

    1. Read in one line of the input string.
    2. Split the line at any match to "[^"]*"(*SKIP)(*F)|\| to get a list of fields.
    3. For each field:
      1. If it contains a match to (?<!^)"(?!$), surround the field in quotes.
      2. Afterwards, replace every match to (?<!^)"(?!$) with "".
    4. Rejoin the list of fields and output it as a line of the new CSV.
    5. Go back to step #1 until all lines have been processed.

    Regex #1 Test
    Regex #2 Test (Stage 1)
    Regex #2 Test (Stage 2)

    My attempt at translating this to PERL