Search code examples
javaregexquotes

Java Regex - Capturing everything outside quotes


TLDR: I'm looking to capture everything outside of quotation marks, but I seem to fail to do so in Java with this regex \"|"(?:\"|[^"])*"|([^\"]+) while it works on websites such as http://myregexp.com/. Can anyone point me what I'm doing wrong ?

Hi, I'm currently trying to analyse a .java source code and extract as a string everything outside quotation marks (ignoring escaped quotes).

For example, in this string :

This should be captured "not this" and "not \"this\" either".

I should be able with, pattern and matcher, to find "This should be captured", "and", ".".

What I currently have is \"[^\"]+\"|([^\"]+), which works well if there is an equal pair of "" in the document but breaks as soon as there is an escaped one.

On an online regex testers, I tried \"|"(?:\"|[^"])*"|([^\"]+) which seems to do exactly what I'm looking for, but when I try it in Java it doesn't.


Solution

  • It seems for your current task, you may use a pattern to match double quoted string literals to split the string:

    List[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");
    

    See the Java demo:

    String s = "This should be captured \"not this\" and \"not \\\"this\\\" either\".";
    String[] res = s.split("\\s*\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"\\s*");
    System.out.println(Arrays.toString(res));
    // => [This should be captured, and, .]
    

    Pattern details:

    • \\s* - 0+ whitespaces
    • \" - a double quote
    • [^\"\\\\]* - 0+ chars other than " and \
    • (?:\\\\.[^\"\\\\]*)* - 0+ sequences of:
      • \\\\. - a \ and any char other than line break chars
      • [^\"\\\\]* - 0+ chars other than " and \
    • \"\\s* - a " and 0+ whitespaces