TLDR: I'm looking to capture everything outside of quotation marks, but I seem to fail to do so in Java with this regex \"|"(?:\"|[^"])*"|([^\"]+) while it works on websites such as http://myregexp.com/. Can anyone point me what I'm doing wrong ?
Hi, I'm currently trying to analyse a .java source code and extract as a string everything outside quotation marks (ignoring escaped quotes).
For example, in this string :
This should be captured "not this" and "not \"this\" either".
I should be able with, pattern and matcher, to find "This should be captured", "and", ".".
What I currently have is \"[^\"]+\"|([^\"]+), which works well if there is an equal pair of "" in the document but breaks as soon as there is an escaped one.
On an online regex testers, I tried \"|"(?:\"|[^"])*"|([^\"]+) which seems to do exactly what I'm looking for, but when I try it in Java it doesn't.
It seems for your current task, you may use a pattern to match double quoted string literals to split the string:
List[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");
See the Java demo:
String s = "This should be captured \"not this\" and \"not \\\"this\\\" either\".";
String[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");
System.out.println(Arrays.toString(res));
// => [This should be captured, and, .]
Pattern details:
\\s*
- 0+ whitespaces\"
- a double quote[^\"\\\\]*
- 0+ chars other than "
and \
(?:\\\\.[^\"\\\\]*)*
- 0+ sequences of:
\\\\.
- a \
and any char other than line break chars[^\"\\\\]*
- 0+ chars other than "
and \
\"\\s*
- a "
and 0+ whitespaces