Search code examples
javajsonregextokenize

Why does this tokenizer return incorrect values?


When tokenising a JSON string, it returns an incorrect value, like it concatenates multiple values at once (I.e. "username": "Azoraqua", "age": } (It should be IDENTIFIER (2 times) and STRING_LITERAL (1 time) respectively), note that it does return the age number as it's own token (INTEGER_LITERAL respectively).

I've tried several ways to achieve the correct behaviour:
- Changing some Regular Expressions related to IDENTIFER and STRING_LITERAL.
- Changing some of the actual tokenising logic.

private static final Set<TokenData> tokenDatas = new LinkedHashSet<>();

static {
    tokenDatas.add(new TokenData(Pattern.compile("^(,:)"), TokenType.TOKEN));
    tokenDatas.add(new TokenData(Pattern.compile("^(\\{)"), TokenType.BEGIN_OBJECT));
    tokenDatas.add(new TokenData(Pattern.compile("^(})"), TokenType.END_OBJECT));
    tokenDatas.add(new TokenData(Pattern.compile("^(\\[)"), TokenType.BEGIN_ARRAY));
    tokenDatas.add(new TokenData(Pattern.compile("^(])"), TokenType.END_ARRAY));
    tokenDatas.add(new TokenData(Pattern.compile("^(\".*\":)"), TokenType.IDENTIFIER));
    tokenDatas.add(new TokenData(Pattern.compile("^(\".*\")"), TokenType.STRING_LITERAL, (s) -> s.substring(1, s.length() - 1)));
    tokenDatas.add(new TokenData(Pattern.compile("^((-)?[0-9]+)"), TokenType.INTEGER_LITERAL));
    tokenDatas.add(new TokenData(Pattern.compile("^((-)?[0-9]*(\\.)[0-9]+)"), TokenType.DOUBLE_LITERAL));
    tokenDatas.add(new TokenData(Pattern.compile("^(true|false)", Pattern.CASE_INSENSITIVE), TokenType.BOOLEAN_LITERAL));
}
@Override
public Token next() {
    str = str.trim();

    if (pushback) {
        pushback = false;
        return lastToken;
    }

    if (str.isEmpty()) {
        return (lastToken = new Token(TokenType.EMPTY, ""));
    }

    for (TokenData data: tokenDatas) {
        Matcher matcher = data.pattern.matcher(str);

        if (matcher.find()) {
            String token = matcher.group().trim();
            str = matcher.replaceFirst("");

            if (data.action != null) {
                token = data.action.apply(token);
            }

            return (lastToken = new Token(data.type, token));
        }
    }

    throw new IllegalStateException("Could not parse " + str);
}

When the input is {"username": "Azoraqua", "age": 21} then the output should be:
1. BEGIN_OBJECT ( { )
2. IDENTIFIER ( "username": )
3. STRING_LITERAL ( "Azoraqua" )
4. TOKEN ( , )
5. IDENTIFIER ( "age" )
6. INTEGER_LITERAL ( 21 )
7. END_OBJECT ( } )

How do I solve the problem?


Solution

  • The problem is most likely in this line:

        tokenDatas.add(new TokenData(Pattern.compile("^(\".*\":)"), TokenType.IDENTIFIER));
    

    Regular expressions are greedy. This means that they will try to match as much as they possibly can.

    So, for a string such as this:

    "username": "Azoraqua", "age": 21 }

    The .*\": part of regular expression will match from the u in "username" for all characters up to and including the last possible \": which appears just before the "space" character in front of 21.

    Try making your regex non-greedy with a "?" modifier.

        tokenDatas.add(new TokenData(Pattern.compile("^(\".*?\":)"), TokenType.IDENTIFIER));
    

    You might want to allow for optional whitespace as well

        tokenDatas.add(new TokenData(Pattern.compile("^(\".*?\"\s*:)"), TokenType.IDENTIFIER));
    

    You will almost certainly have a similar problem with TokenType.STRING_LITERAL. It is also greedy. You can fix it with the same solution, i.e. making the .* non-greedy.