Search code examples
javaregexregex-lookaroundsregex-group

Regex to tokenize log line


I've a log line as follows:

[2021-03-10 00:13:32.901] [DefaultDispatcher-worker-2 @coroutine#3] [DEBUG] [4231c006d9083a302fce59d5f0957226] [42c5ac3c0acfc68d] [GreeterImpl] Hello John

It's 6 blocks of text within [] and then the rest. I'm looking for a regex to extract the text within [], and also at the end. A text block within [] can be empty.

I tried (?:\[([^\[\]]*)\])+([^\[\]]+) but it only matches the first block in []. I've also tried (?:(?<=\[)[^\[\]]*(?=\]))+([^\[\]]+) but that doesn't match anything.

FWIW, the regex will be implemented in Java.


Solution

  • Short edit: This slightly simpler regular expression works too:

    (?:(?<=\[)[^\[\]]*)|(?:(?<=\])[^\[\]]*$)
    

    I have taken it from your own comment.

    Original answer follows.

    TL;DR

    (?:(?<=^\[| \[)[^\[\]]*)|(?:(?<=\] )[^\[\]]*$)
    

    Explanation: There are two parts separated by |, “or”.

    1. The first part, (?:(?<=^\[| \[)[^\[\]]*) matches what is inside square brackets. [^\[\]]* near the end matches the longest possible run of characters that are neither [ nor ]. (?<=^\[| \[) requires it to be preceded either by the beginning of the string and a [ or by [. Finally I have put the whole thing into a non-capturing group to make sure that the lookbehind has precedence over the |.
    2. The second part, (?:(?<=\] )[^\[\]]*$), matches what is outside square brackets at the end of the log line (Hello John in the example). This time the run of non-brackets must be preceded by ] and followed by the end of the line.

    See it in action:

    1. On regex101 where I built it

    2. In Java:

      String logLine = "[2021-03-10 00:13:32.901]"
              + " [DefaultDispatcher-worker-2 @coroutine#3] [DEBUG]"
              + " [4231c006d9083a302fce59d5f0957226] [42c5ac3c0acfc68d]"
              + " [GreeterImpl] Hello John";
      
      Matcher m = Pattern
              .compile("(?:(?<=^\\[| \\[)[^\\[\\]]*)|(?:(?<=\\] )[^\\[\\]]*$)")
              .matcher(logLine);
      while (m.find()) {
          System.out.println(m.group());
      }
      

    Output is:

    2021-03-10 00:13:32.901
    DefaultDispatcher-worker-2 @coroutine#3
    DEBUG
    4231c006d9083a302fce59d5f0957226
    42c5ac3c0acfc68d
    GreeterImpl
    Hello John
    

    A different idea: String.split()

        String[] tokens = logLine.split("\\] \\[|\\] (?!\\[)");
        assert tokens[0].startsWith("[") : logLine;
        tokens[0] = tokens[0].substring(1);
    
        for (String token : tokens) {
            System.out.println(token);
        }
    

    Output is the same as before.

    I am splitting at either ] [ or ] not followed by [ (for the last split). It leaves the first [ intact, so I have to remove that separately, which is not so nice. Otherwise I find it simpler to understand than the other solutions.