Search code examples
pythonregexcsvsublimetext3double-quotes

Regex for CSV split including multiple double quotes


I have a CSV column data containing text. Each row is separated with double quotes "

Sample text in a row is similar to this (notice: new lines and the spaces before each line are intended)

"Lorem ipsum dolor sit amet, 
 consectetur adipisicing elit, sed do eiusmod
 tempor incididunt ut labore et dolore magna 
 aliqua. Ut ""enim ad"" minim veniam,
 quis nostrud exercitation ullamco laboris nisi 
 ut aliquip ex ea commodo
 consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
 cillum dolore eu fugiat ""nulla pariatu"""
"ex ea commodo
 consequat. Duis aute irure ""dolor in"" reprehenderit 
 in voluptate velit esse
 cillum dolore eu fugiat nulla pariatur. 
 Excepteur sint occaecat cupidatat non
 proident, sunt in culpa qui officia deserunt 
 mollit anim id est laborum."

The above represent 2 subsequent rows.

I want to select as separated groups all the text contained between every first double quote " (starting a line) and every LAST double quote "

As you can see tho, there are line break in the text, along with subsequent escaped double quotes "" wich are part of the text that I need to select.

I came up with something like this

(?s)(?!")[^\s](.+?)(?=")

but the multiple double quotes are breaking my desired match

I'm a real novice with regex, so I think maybe I'm missing something very basic. Dunno if relevant but I'm using Sublime Text 3 so should be python I think.

What can I do to achieve what I need?


Solution

  • You can use the following regex:

    "[^"]*(?:""[^"]*)*"
    

    See demo

    This regex will match either a non-quote, or 2 consequent double quotes inside double quotation marks.

    How does it work? Let me share a graphics from debuggex.com:

    enter image description here

    With the regex, we match:

    • " - (1) - a literal quote
    • [^"]* - (2, 3) - 0 or more characters other than a quote (yes, including a newline, this is a negated character class), if there are none, then the regex searches for the final literal quote (6)
    • (?:""[^"]*)* - (4,5) - 0 or more sequences of:
      • "" - (4) - double double quotation marks
      • [^"]* - (5) - 0 or more characters other than a quote
    • " - (6) - the final literal quote.

    This works faster than "(?:[^"]|"")*" (although yielding the same results), because the processing the former is linear, involving much less backtracking.