Search code examples
javaregexunique

get unique regex matcher results (without using maps or lists)


Is there a way to get only the unique matches? without using a list or a map after the matching, I want the matcher output to be unique right away.

Sample input/output:

String input = "This is a question from [userName] about finding unique regex matches for [inputString] without using any lists or maps. -[userName].";
Pattern pattern = Pattern.compile("\\[[^\\[\\]]*\\]");
Matcher matcher = pattern.matcher(rawText);
while (matcher.find()) {
    String tokenName = matcher.group(0);
    System.out.println(tokenName);
}

This will output the following:

[userName]
[inputString]
[userName]

But I want it to output the following:

[userName]
[inputString]

Solution

  • Yes there is. You can combine a negative lookahead and a backreference:

    "(\\[[^\\[\\]]*\\])(?!.*\\1)"
    

    That will only match if that, which was matched by your actual pattern, does not occur again in the string. Effectively, that means you always get the last occurrence of every match, so you would get them in a different order:

    [inputString]
    [userName]
    

    If the order is a problem for you (i.e. if it's crucial to order them by first occurrence), you won't be able to do this using regex only. You would need a variable-length look*behind* for that, and that is not supported by Java.

    Further reading:


    Some notes on a general solution

    Note that this will work with any pattern whose matches are of non-zero width. The general solution is simply:

    (yourPatternHere)(?!.*\1)
    

    (I left out the double backslash, because that only applies to a few languages.)

    If you want it to work with patterns that have zero-width matches (because you only want to know a position and are using lookarounds only for some reason), you could do this:

    (zeroWidthPatternHere)(?!.+\1)
    

    Also, note that (generally) you might have to use the "singleline" or "dotall" option, if your input may contain linebreaks (otherwise the lookahead will only check in the current line). If you cannot or do not want to activate that (because you have a pattern that includes periods which should not match line breaks; or because you use JavaScript), this is the general solution:

    (yourPatternHere)(?![\s\S]*\1)
    

    And to make this answer even more widely applicable, here is how you could match only the first occurrence of every match (in an engine with variable-length lookbehinds, like .NET):

    (yourPatternHere)(?<!\1.*\1)
    or
    (yourPatternHere)(?<!\1[\s\S]*\1)