Search code examples
javaregexalgorithmstring-parsingsplit

How to split a string by multiple separators - and know which separator matched


With String.split it is easy to split a string by multiple separators. You just needs to define a regular expression which matches all separators you want to use. For example

"1.22-3".split("[.-]")

results in the list with the elements "1", "22", and "3". So far so good.

Now however I also need to know which one of the separators was found between the segments. Is there a straightforward way to achieve this?

I looked at String.split, its deprecated predecessor StringTokenizer, and other supposedly more modern libraries (e.g. StrTokenizer from Apatche Commons), but with none of them I can get hold of the matched separator.


Solution

  • It’s quite simple if you retrace what String.split(regex) does and record the information which String.split ignores:

    String source = "1.22-3";
    Matcher m=Pattern.compile("[.-]").matcher(source);
    ArrayList<String> elements=new ArrayList<>();
    ArrayList<String> separators=new ArrayList<>();
    int pos;
    for(pos=0; m.find(); pos=m.end()) {
        elements.add(source.substring(pos, m.start()));
        separators.add(m.group());
    }
    elements.add(source.substring(pos));
    

    At the end of this code, separators.get(x) yields to the separator between elements.get(x) and elements.get(x+1). It should be clear that separators is one item smaller than elements.

    If you want to have elements and separators in one list, just change the code to let these two lists be the same list. The items are already added in order of occurrence.