Search code examples
javatokenize

StreamTokenizer mangles integers and loose periods


I've appropriated and modified the below code which does a pretty good job of tokenizing Java code using Java's StreamTokenizer. Its number handling is problematic, though:

  1. it turns all integers into doubles. I can get past that by testing num % 1 == 0, but this feels like a hack
  2. More critically, a . following whitespace is treated as a number. "Class .method()" is legal Java syntax, but the resulting tokens are [Word "Class"], [Whitespace " "], [Number 0.0], [Word "method"], [Symbol "("], and [Symbol ")"]

I'd be happy turning off StreamTokenizer's number parsing entirely and parsing the numbers myself from word tokens, but commenting st.parseNumbers() seems to have no effect.

public class JavaTokenizer {

private String code;

private List<Token> tokens;

public JavaTokenizer(String c) {
    code = c;
    tokens = new ArrayList<>();
}

public void tokenize() {
    try {
        // Create the tokenizer
        StringReader sr = new StringReader(code);
        StreamTokenizer st = new StreamTokenizer(sr);

        // Java-style tokenizing rules
        st.parseNumbers();
        st.wordChars('_', '_');
        st.eolIsSignificant(false);

        // Don't want whitespace tokens
        //st.ordinaryChars(0, ' ');

        // Strip out comments
        st.slashSlashComments(true);
        st.slashStarComments(true);

        // Parse the file
        int token;
        do {
            token = st.nextToken();
            switch (token) {
            case StreamTokenizer.TT_NUMBER:
                // A number was found; the value is in nval
                double num = st.nval;
                if(num % 1 == 0)
                  tokens.add(new IntegerToken((int)num);
                else
                  tokens.add(new FPNumberToken(num));
                break;
            case StreamTokenizer.TT_WORD:
                // A word was found; the value is in sval
                String word = st.sval;
                tokens.add(new WordToken(word));
                break;
            case '"':
                // A double-quoted string was found; sval contains the contents
                String dquoteVal = st.sval;
                tokens.add(new DoubleQuotedStringToken(dquoteVal));
                break;
            case '\'':
                // A single-quoted string was found; sval contains the contents
                String squoteVal = st.sval;
                tokens.add(new SingleQuotedStringToken(squoteVal));
                break;
            case StreamTokenizer.TT_EOL:
                // End of line character found
                tokens.add(new EOLToken());
                break;
            case StreamTokenizer.TT_EOF:
                // End of file has been reached
                tokens. add(new EOFToken());
                break;
            default:
                // A regular character was found; the value is the token itself
                char ch = (char) st.ttype;
                if(Character.isWhitespace(ch))
                    tokens.add(new WhitespaceToken(ch));
                else
                    tokens.add(new SymbolToken(ch));
                break;
            }
        } while (token != StreamTokenizer.TT_EOF);
        sr.close();
    } catch (IOException e) {
    }
}

public List<Token> getTokens() {
    return tokens;
}

}

Solution

  • parseNumbers() in "on" by default. Use resetSyntax() to turn off number parsing and all other predefined character types, then enable what you need.

    That said, manual number parsing might get tricky with accounting for dots and exponents... With a scanner and regular expressions it should be relatively straightforward to implement your own tokenizer, tailored exactly to your needs. For an example, you may want to take a look at the Tokenizer inner class here: https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java (about 120 LOC at the end)