I have a CSV column data containing text. Each row is separated with double quotes "
Sample text in a row is similar to this (notice: new lines and the spaces before each line are intended)
"Lorem ipsum dolor sit amet,
consectetur adipisicing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna
aliqua. Ut ""enim ad"" minim veniam,
quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo
consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse
cillum dolore eu fugiat ""nulla pariatu"""
"ex ea commodo
consequat. Duis aute irure ""dolor in"" reprehenderit
in voluptate velit esse
cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non
proident, sunt in culpa qui officia deserunt
mollit anim id est laborum."
The above represent 2 subsequent rows.
I want to select as separated groups all the text contained between every first double quote "
(starting a line) and every LAST double quote "
As you can see tho, there are line break in the text, along with subsequent escaped double quotes ""
wich are part of the text that I need to select.
I came up with something like this
(?s)(?!")[^\s](.+?)(?=")
but the multiple double quotes are breaking my desired match
I'm a real novice with regex, so I think maybe I'm missing something very basic. Dunno if relevant but I'm using Sublime Text 3 so should be python I think.
What can I do to achieve what I need?
You can use the following regex:
"[^"]*(?:""[^"]*)*"
See demo
This regex will match either a non-quote, or 2 consequent double quotes inside double quotation marks.
How does it work? Let me share a graphics from debuggex.com:
With the regex, we match:
"
- (1) - a literal quote[^"]*
- (2, 3) - 0 or more characters other than a quote (yes, including a newline, this is a negated character class), if there are none, then the regex searches for the final literal quote (6)(?:""[^"]*)*
- (4,5) - 0 or more sequences of:
""
- (4) - double double quotation marks[^"]*
- (5) - 0 or more characters other than a quote"
- (6) - the final literal quote.This works faster than "(?:[^"]|"")*"
(although yielding the same results), because the processing the former is linear, involving much less backtracking.