Search code examples
javaregextokenize

In Java, how do you tokenize a string that contains the delimiter in the tokens?


Let's say I have the string:

String toTokenize = "prop1=value1;prop2=String test='1234';int i=4;;prop3=value3";

I want the tokens:

  1. prop1=value1
  2. prop2=String test='1234';int i=4;
  3. prop3=value3

For backwards compatibility, I have to use the semicolon as a delimiter. I have tried wrapping code in something like CDATA:

String toTokenize = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";

But I can't figure out a regular expression to ignore the semicolons that are within the cdata tags.

I've tried escaping the non-delimiter:

String toTokenize = "prop1=value1;prop2=String test='1234'\\;int i=4\\;;prop3=value3";

But then there is an ugly mess of removing the escape characters.

Do you have any suggestions?


Solution

  • You may match either <![CDATA...]]> or any char other than ;, 1 or more times, to match the values. To match the keys, you may use a regular \w+ pattern:

    (\w+)=((?:<!\[CDATA\[.*?]]>|[^;])+)
    

    See the regex demo.

    Details

    • (\w+) - Group 1: one or more word chars
    • = - a = sign
    • ((?:<!\[CDATA\[.*?]]>|[^;])+) - Group 1: one or more sequences of
      • <!\[CDATA\[.*?]]> - a <![CDATA[...]]> substring
      • | - or
      • [^;] - any char but ;

    See a Java demo:

    String rx = "(\\w+)=((?:<!\\[CDATA\\[.*?]]>|[^;])+)";
    String s = "prop1=value1;prop2=<![CDATA[String test='1234';int i=4;]]>;prop3=value3";
    Pattern pattern = Pattern.compile(rx);
    Matcher matcher = pattern.matcher(s);
    while (matcher.find()) {
        System.out.println(matcher.group(1) + " => " + matcher.group(2));
    }
    

    Results:

    prop1 => value1
    prop2 => <![CDATA[String test='1234';int i=4;]]>
    prop3 => value3