I am building a compiler. Some of the specifications of this are the following:
Now I have to split a source code line to tokenize it. Example case:
PRINT $ THE FLOAT IS $ * DISPLAY THE RESULT *
As I will tokenize it, it should produce:
PRINT - token is KEYWORD
THE FLOAT IS - token is STRING_LITERAL
DISPLAY THE RESULT - token is COMMENT
I would like to know the most efficient way to obtain this. Note that I still have to validate the occurence of string literal and comment. (Ex. Check if it is properly enclosed). So far my way is to split each line by whitespaces and and when a lexeme contains a "$" or "*", I will validate the string literal. Here is my implementation:
private void getLexemes(){
for(String line : newSourceCode){
String[] lexemesInALine = line.trim().split("[\\s]+");
for(String lexemeInALine : lexemesInALine){
if(!(lexemeInALine.contains("$"))){
lexemes.add(lexemeInALine);
tempTokens.add(findToken(lexemeInALine));
line = line.replaceFirst(lexemeInALine,"").trim();
}else{
validateStringType(line);
break;
}
}
Thank you for the help.
I assume your language is deterministic and context-free? That means, you can't correctly parse it using regular expressions.
What you need is a state machine that works on a stream of tokens.
Java comes with two classes that might work for you: StreamTokenizer
and StringTokenizer
.
But what you really want is to use one of the dozens parser generators. Maybe something like ANTLR. There are plenty described here:
https://en.wikipedia.org/wiki/Comparison_of_parser_generators
If all this fails, a finite state machine it is. Something along those lines
public class Parsy {
enum State { string, comment, token }
void parse(StringTokenizer tokenizer) {
State state = State.token;
List<String> tokens = new ArrayList<>();
while (tokenizer.hasMoreTokens()) {
String token = tokenizer.nextToken();
// figure out type of token
if (token.length() == 1) {
char delim = token.charAt(0);
switch (delim) {
case '$':
switch (state) {
case token: {
// a string literal has started, emit what we have, start a string
printOut(tokens, state);
tokens.clear();
tokens.add(token);
state = State.string;
break;
}
case string: { // parsing a string, so this ends it
printOut(tokens, state);
tokens.clear();
state = State.token;
break;
}
case comment: { // $ is ignored since we are in a comment
tokens.add(token);
break;
}
}
break;
// ...
}
} else {
// not a delimiter token
tokens.add(token);
}
} // end of while
if (state != State.token) {
System.out.println("Oops! Syntax error. I'm still parsing" + state);
}
}
}