java regex string split compiler-construction

Split string using custom regex java

I am building a compiler. Some of the specifications of this are the following:

String literals are enclosed by dollar sign ("$") - eg. $ string sample $
Comments are enclosed by "*" - eg. * sample comment *
Comments could exist anywhere execept between operations - eg. 4 + * sample comment * 5 - (this is not allowed)

Now I have to split a source code line to tokenize it. Example case:

PRINT $ THE FLOAT IS $ * DISPLAY THE RESULT *

As I will tokenize it, it should produce:

PRINT - token is KEYWORD
THE FLOAT IS - token is STRING_LITERAL
DISPLAY THE RESULT - token is COMMENT

I would like to know the most efficient way to obtain this. Note that I still have to validate the occurence of string literal and comment. (Ex. Check if it is properly enclosed). So far my way is to split each line by whitespaces and and when a lexeme contains a "$" or "*", I will validate the string literal. Here is my implementation:

private void getLexemes(){
    for(String line : newSourceCode){
        String[] lexemesInALine = line.trim().split("[\\s]+");
        for(String lexemeInALine : lexemesInALine){
            if(!(lexemeInALine.contains("$"))){
                lexemes.add(lexemeInALine);
                tempTokens.add(findToken(lexemeInALine));
                line = line.replaceFirst(lexemeInALine,"").trim();
            }else{
                validateStringType(line);
                break;
            }
        }

Thank you for the help.

Solution

I assume your language is deterministic and context-free? That means, you can't correctly parse it using regular expressions.

What you need is a state machine that works on a stream of tokens. Java comes with two classes that might work for you: StreamTokenizer and StringTokenizer.

But what you really want is to use one of the dozens parser generators. Maybe something like ANTLR. There are plenty described here:

https://en.wikipedia.org/wiki/Comparison_of_parser_generators

If all this fails, a finite state machine it is. Something along those lines

public class Parsy {
    enum State { string, comment, token }
    void parse(StringTokenizer tokenizer) {
        State state = State.token;

        List<String> tokens = new ArrayList<>();
        while (tokenizer.hasMoreTokens()) {
            String token = tokenizer.nextToken();
            // figure out type of token
            if (token.length() == 1) {
                char delim = token.charAt(0);
                switch (delim) {
                    case '$':
                        switch (state) {
                            case token: {
                                // a string literal has started, emit what we have, start a string
                                printOut(tokens, state);
                                tokens.clear();
                                tokens.add(token);
                                state = State.string;
                                break;
                            }
                            case string: { // parsing a string, so this ends it
                                printOut(tokens, state);
                                tokens.clear();
                                state = State.token;
                                break;
                            }
                            case comment: { // $ is ignored since we are in a comment
                                tokens.add(token);
                                break;
                            }
                        }
                        break;
                    // ...
                }
            } else {
                // not a delimiter token
                tokens.add(token);
            }

        } // end of while
    if (state != State.token) {
       System.out.println("Oops! Syntax error. I'm still parsing" + state);
     }
    }
}