Search code examples
javaregextokenize

Regarding exactly n occurrence of character in regex


I was trying to break string in tokens with + = == <= >= != || { } when they occur outside double quotes. But it is tokenizing with single occurrence of | < > !. That is not required. So how to handle it?

String line1= "sa2dvf=s||a|df&&v<gdsf==ds!gv!=fdgv\"fvdsvg=kjhbhbj==\"";
String regex = "[\\{\\}+={!=}{<=}{>=}{||}](?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)";
String[] tokens = line1.split(regex, -1);
for(String val : tokens) {
    System.out.println(val);
}

And it's output is:

sa2dvf
s

a
df&&v
gdsf

ds
gv

fdgv"fvdsvg=kjhbhbj=="

But required is:

sa2dvf
s
a|df&&v<gdsf
ds!gv
fdgv"fvdsvg=kjhbhbj=="

Solution

  • You can use this lookahead regex for splitting:

    String[] arr = str.split("(?:[<>=!]=|\\|\\||[+=\\{}])(?=(?:(?:[^\"]*\"){2})*[^\"]*$)");
    

    RegEx Demo

    RegEx Breakup:

    • (?:[<>=!]=|\\|\\||[+=\\{}]): Match one of the operators we want to split on
    • (?:[^"]*"){2} finds a pair of quotes
    • (?:(?:[^"]*"){2})* finds 0 or more pair of quotes
    • [^"]*$ makes sure we don't have any more quotes after last matched quote So (?=...) asserts that we have even number of quotes ahead thus matching symbols outside the quoted string only.