I want to parse an array from a custom key-value protocol. It looks like this
RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor"
FLAGS: 1, 2, 3
In Java the String looks this (it uses CRLF as linebreak):
RESPONSE GAMEINFO OK\\r\\nNAME: \"gamelobby\"\\r\\nPLAYERS: \"alice\", \"bob\", \"hodor\"FLAGS: 1, 2, 3\\r\\n
I want to capture "alice", "bob", "hodor"
as-is. So I used this regexp, which was tested in Sublime Text and on regex101.com (keys are case insensitive)
(?<=(?i:PLAYERS): )([A-Za-z0-9\s\.,:;\?!\n"_-]*)(?=\r\n)
This is a screenshot from Sublime Text (note: I left out \r here):
When I try to capture the group, I get the next line too:
Pattern p = Pattern.compile("(?<=(?i:"+key+"): )([A-Za-z0-9\\s\\.,:;\\?!\\n\"_-]*)(?=\\r\\n)");
Matcher matcher = p.matcher(message);
matcher.find();
String value = new String();
try {
value = matcher.group(); // = "\"alice\", \"bob\", \"hodor\"\\r\\nFLAGS: 1, 2, 3"
} ...
NOTE: \"
or \\\"
don't seem to make a difference.
Why is FLAGS: 1, 2, 3
captured until \\r\\n
, but not in the line above? Is positive lookbehind and lookahead possible? Which lookhead / lookbehind is evaluated first?
EDIT: Definition of the string array is
values = string*("," WSP string)
string = DQUOTE *(ALPHA / DIGIT / WSP / punctuation / "\n") DQUOTE
punctuation = "." / ":" / "," / ";" / "?" / "!" / "-" / "_"
Just write the code according to your grammar. The grammar doesn't seem ambiguous to me, so if you just follow it and compose your regex piece by piece, you are going to be alright:
String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
String PUNCTUATION_RE = "[.:,;?!_-]";
String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
String PLAYERS_RE = "PLAYERS:" + WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
Currently,\r\n
is used to check for line separator at the end of PLAYERS
entry. Change it to whatever specified in your specification.
This solution only works for parsing valid input. Parsing invalid input depends on your recovery algorithm and the line separator.
If the line separator allows for \n
as well as \r\n
, it is hard to recover from an error. For example, if there is a user named ABC\nFLAGS: 1, 2, 3
(allowed according to grammar), but the closing double quote is missing, the list of players will be broken, and you won't be able to tell whether FLAGS:
is part of the previous line or a different header.
RESPONSE GAMEINFO OK
NAME: "gamelobby"
PLAYERS: "alice", "bob", "hodor", "ABC
FLAGS: 1, 2, 3
FLAGS: 1, 2, 3
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class SO28290386 {
public static void main(String[] args) {
String WHITESPACE_RE = "[ ]"; // Modify this according to your grammar
String PUNCTUATION_RE = "[.:,;?!_-]";
String STRING_RE = "\"(?:[A-Za-z0-9" + WHITESPACE_RE + PUNCTUATION_RE + "\n])*\"";
String VALUES_RE = STRING_RE + "(?:," + WHITESPACE_RE + STRING_RE + ")*";
String PLAYERS_RE = "PLAYERS:" + WHITESPACE_RE + "(" + VALUES_RE + ")(?=\r\n)";
System.out.println(PLAYERS_RE);
String input = "RESPONSE GAMEINFO OK\r\nNAME: \"gamelobby\"\r\nPLAYERS: \"alice\", \"bob\", \"hodor\", \"new\nline\"\r\nFLAGS: 1, 2, 3\r\n";
System.out.println("INPUT");
System.out.println(input);
Pattern p = Pattern.compile(PLAYERS_RE);
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group(0));
System.out.println(m.group(1));
}
}
}