Search code examples
javaregexregex-lookaroundslookbehind

Combined positive lookbehind and lookahead


I want to parse an array from a custom key-value protocol. It looks like this

RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor"
FLAGS: 1, 2, 3

In Java the String looks this (it uses CRLF as linebreak):

RESPONSE GAMEINFO OK\\r\\nNAME: \"gamelobby\"\\r\\nPLAYERS: \"alice\", \"bob\", \"hodor\"FLAGS: 1, 2, 3\\r\\n

I want to capture "alice", "bob", "hodor" as-is. So I used this regexp, which was tested in Sublime Text and on regex101.com (keys are case insensitive)

(?<=(?i:PLAYERS): )([A-Za-z0-9\s\.,:;\?!\n"_-]*)(?=\r\n)

This is a screenshot from Sublime Text (note: I left out \r here):

enter image description here

When I try to capture the group, I get the next line too:

Pattern p = Pattern.compile("(?<=(?i:"+key+"): )([A-Za-z0-9\\s\\.,:;\\?!\\n\"_-]*)(?=\\r\\n)");
Matcher matcher = p.matcher(message);
matcher.find();
String value = new String();
try {
    value = matcher.group(); // = "\"alice\", \"bob\", \"hodor\"\\r\\nFLAGS: 1, 2, 3"
} ...

NOTE: \" or \\\" don't seem to make a difference.

Why is FLAGS: 1, 2, 3 captured until \\r\\n, but not in the line above? Is positive lookbehind and lookahead possible? Which lookhead / lookbehind is evaluated first?

EDIT: Definition of the string array is

values        = string*("," WSP string)
string        = DQUOTE *(ALPHA / DIGIT / WSP / punctuation / "\n") DQUOTE
punctuation   = "." / ":" / "," / ";" / "?" / "!" / "-" / "_"

Solution

  • Just write the code according to your grammar. The grammar doesn't seem ambiguous to me, so if you just follow it and compose your regex piece by piece, you are going to be alright:

    String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
    String PUNCTUATION_RE = "[.:,;?!_-]";
    String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
    String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
    String PLAYERS_RE = "PLAYERS:" +  WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
    

    Currently,\r\n is used to check for line separator at the end of PLAYERS entry. Change it to whatever specified in your specification.

    Caveat

    This solution only works for parsing valid input. Parsing invalid input depends on your recovery algorithm and the line separator.

    If the line separator allows for \n as well as \r\n, it is hard to recover from an error. For example, if there is a user named ABC\nFLAGS: 1, 2, 3 (allowed according to grammar), but the closing double quote is missing, the list of players will be broken, and you won't be able to tell whether FLAGS: is part of the previous line or a different header.

    RESPONSE GAMEINFO OK
    NAME: "gamelobby"
    PLAYERS: "alice", "bob", "hodor", "ABC
    FLAGS: 1, 2, 3
    FLAGS: 1, 2, 3
    

    Full example

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class SO28290386 {
        public static void main(String[] args) {
            String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
            String PUNCTUATION_RE = "[.:,;?!_-]";
            String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
            String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
            String PLAYERS_RE = "PLAYERS:" +  WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
            System.out.println(PLAYERS_RE);
    
            String input = "RESPONSE GAMEINFO OK\r\nNAME: \"gamelobby\"\r\nPLAYERS: \"alice\", \"bob\", \"hodor\", \"new\nline\"\r\nFLAGS: 1, 2, 3\r\n";
            System.out.println("INPUT");
            System.out.println(input);
    
            Pattern p = Pattern.compile(PLAYERS_RE);
            Matcher m = p.matcher(input);
            while (m.find()) {
                System.out.println(m.group(0));
                System.out.println(m.group(1));
            }
        }
    }