I've appropriated and modified the below code which does a pretty good job of tokenizing Java code using Java's StreamTokenizer. Its number handling is problematic, though:
I'd be happy turning off StreamTokenizer's number parsing entirely and parsing the numbers myself from word tokens, but commenting st.parseNumbers() seems to have no effect.
public class JavaTokenizer {
private String code;
private List<Token> tokens;
public JavaTokenizer(String c) {
code = c;
tokens = new ArrayList<>();
}
public void tokenize() {
try {
// Create the tokenizer
StringReader sr = new StringReader(code);
StreamTokenizer st = new StreamTokenizer(sr);
// Java-style tokenizing rules
st.parseNumbers();
st.wordChars('_', '_');
st.eolIsSignificant(false);
// Don't want whitespace tokens
//st.ordinaryChars(0, ' ');
// Strip out comments
st.slashSlashComments(true);
st.slashStarComments(true);
// Parse the file
int token;
do {
token = st.nextToken();
switch (token) {
case StreamTokenizer.TT_NUMBER:
// A number was found; the value is in nval
double num = st.nval;
if(num % 1 == 0)
tokens.add(new IntegerToken((int)num);
else
tokens.add(new FPNumberToken(num));
break;
case StreamTokenizer.TT_WORD:
// A word was found; the value is in sval
String word = st.sval;
tokens.add(new WordToken(word));
break;
case '"':
// A double-quoted string was found; sval contains the contents
String dquoteVal = st.sval;
tokens.add(new DoubleQuotedStringToken(dquoteVal));
break;
case '\'':
// A single-quoted string was found; sval contains the contents
String squoteVal = st.sval;
tokens.add(new SingleQuotedStringToken(squoteVal));
break;
case StreamTokenizer.TT_EOL:
// End of line character found
tokens.add(new EOLToken());
break;
case StreamTokenizer.TT_EOF:
// End of file has been reached
tokens. add(new EOFToken());
break;
default:
// A regular character was found; the value is the token itself
char ch = (char) st.ttype;
if(Character.isWhitespace(ch))
tokens.add(new WhitespaceToken(ch));
else
tokens.add(new SymbolToken(ch));
break;
}
} while (token != StreamTokenizer.TT_EOF);
sr.close();
} catch (IOException e) {
}
}
public List<Token> getTokens() {
return tokens;
}
}
parseNumbers() in "on" by default. Use resetSyntax() to turn off number parsing and all other predefined character types, then enable what you need.
That said, manual number parsing might get tricky with accounting for dots and exponents... With a scanner and regular expressions it should be relatively straightforward to implement your own tokenizer, tailored exactly to your needs. For an example, you may want to take a look at the Tokenizer
inner class here: https://github.com/stefanhaustein/expressionparser/blob/master/core/src/main/java/org/kobjects/expressionparser/ExpressionParser.java (about 120 LOC at the end)