When tokenising a JSON string, it returns an incorrect value, like it concatenates multiple values at once (I.e. "username": "Azoraqua", "age": }
(It should be IDENTIFIER (2 times) and STRING_LITERAL (1 time) respectively), note that it does return the age
number as it's own token (INTEGER_LITERAL respectively).
I've tried several ways to achieve the correct behaviour:
- Changing some Regular Expressions related to IDENTIFER and STRING_LITERAL.
- Changing some of the actual tokenising logic.
private static final Set<TokenData> tokenDatas = new LinkedHashSet<>();
static {
tokenDatas.add(new TokenData(Pattern.compile("^(,:)"), TokenType.TOKEN));
tokenDatas.add(new TokenData(Pattern.compile("^(\\{)"), TokenType.BEGIN_OBJECT));
tokenDatas.add(new TokenData(Pattern.compile("^(})"), TokenType.END_OBJECT));
tokenDatas.add(new TokenData(Pattern.compile("^(\\[)"), TokenType.BEGIN_ARRAY));
tokenDatas.add(new TokenData(Pattern.compile("^(])"), TokenType.END_ARRAY));
tokenDatas.add(new TokenData(Pattern.compile("^(\".*\":)"), TokenType.IDENTIFIER));
tokenDatas.add(new TokenData(Pattern.compile("^(\".*\")"), TokenType.STRING_LITERAL, (s) -> s.substring(1, s.length() - 1)));
tokenDatas.add(new TokenData(Pattern.compile("^((-)?[0-9]+)"), TokenType.INTEGER_LITERAL));
tokenDatas.add(new TokenData(Pattern.compile("^((-)?[0-9]*(\\.)[0-9]+)"), TokenType.DOUBLE_LITERAL));
tokenDatas.add(new TokenData(Pattern.compile("^(true|false)", Pattern.CASE_INSENSITIVE), TokenType.BOOLEAN_LITERAL));
}
@Override
public Token next() {
str = str.trim();
if (pushback) {
pushback = false;
return lastToken;
}
if (str.isEmpty()) {
return (lastToken = new Token(TokenType.EMPTY, ""));
}
for (TokenData data: tokenDatas) {
Matcher matcher = data.pattern.matcher(str);
if (matcher.find()) {
String token = matcher.group().trim();
str = matcher.replaceFirst("");
if (data.action != null) {
token = data.action.apply(token);
}
return (lastToken = new Token(data.type, token));
}
}
throw new IllegalStateException("Could not parse " + str);
}
When the input is {"username": "Azoraqua", "age": 21}
then the output should be:
1. BEGIN_OBJECT ( {
)
2. IDENTIFIER ( "username":
)
3. STRING_LITERAL ( "Azoraqua"
)
4. TOKEN ( ,
)
5. IDENTIFIER ( "age"
)
6. INTEGER_LITERAL ( 21
)
7. END_OBJECT ( }
)
How do I solve the problem?
The problem is most likely in this line:
tokenDatas.add(new TokenData(Pattern.compile("^(\".*\":)"), TokenType.IDENTIFIER));
Regular expressions are greedy. This means that they will try to match as much as they possibly can.
So, for a string such as this:
"username": "Azoraqua", "age": 21 }
The .*\":
part of regular expression will match from the u in "username" for all characters up to and including the last possible \":
which appears just before the "space" character in front of 21.
Try making your regex non-greedy with a "?" modifier.
tokenDatas.add(new TokenData(Pattern.compile("^(\".*?\":)"), TokenType.IDENTIFIER));
You might want to allow for optional whitespace as well
tokenDatas.add(new TokenData(Pattern.compile("^(\".*?\"\s*:)"), TokenType.IDENTIFIER));
You will almost certainly have a similar problem with TokenType.STRING_LITERAL
. It is also greedy. You can fix it with the same solution, i.e. making the .*
non-greedy.