Search code examples
javaregexstringsplitcompiler-construction

Split string using custom regex java


I am building a compiler. Some of the specifications of this are the following:

  • String literals are enclosed by dollar sign ("$") - eg. $ string sample $
  • Comments are enclosed by "*" - eg. * sample comment *
  • Comments could exist anywhere execept between operations - eg. 4 + * sample comment * 5 - (this is not allowed)

Now I have to split a source code line to tokenize it. Example case:

PRINT $ THE FLOAT IS $ * DISPLAY THE RESULT *

As I will tokenize it, it should produce:

PRINT - token is KEYWORD
THE FLOAT IS - token is STRING_LITERAL
DISPLAY THE RESULT - token is COMMENT 

I would like to know the most efficient way to obtain this. Note that I still have to validate the occurence of string literal and comment. (Ex. Check if it is properly enclosed). So far my way is to split each line by whitespaces and and when a lexeme contains a "$" or "*", I will validate the string literal. Here is my implementation:

private void getLexemes(){
    for(String line : newSourceCode){
        String[] lexemesInALine = line.trim().split("[\\s]+");
        for(String lexemeInALine : lexemesInALine){
            if(!(lexemeInALine.contains("$"))){
                lexemes.add(lexemeInALine);
                tempTokens.add(findToken(lexemeInALine));
                line = line.replaceFirst(lexemeInALine,"").trim();
            }else{
                validateStringType(line);
                break;
            }
        }

Thank you for the help.


Solution

  • I assume your language is deterministic and context-free? That means, you can't correctly parse it using regular expressions.

    What you need is a state machine that works on a stream of tokens. Java comes with two classes that might work for you: StreamTokenizer and StringTokenizer.

    But what you really want is to use one of the dozens parser generators. Maybe something like ANTLR. There are plenty described here:

    https://en.wikipedia.org/wiki/Comparison_of_parser_generators

    If all this fails, a finite state machine it is. Something along those lines

    public class Parsy {
        enum State { string, comment, token }
        void parse(StringTokenizer tokenizer) {
            State state = State.token;
    
            List<String> tokens = new ArrayList<>();
            while (tokenizer.hasMoreTokens()) {
                String token = tokenizer.nextToken();
                // figure out type of token
                if (token.length() == 1) {
                    char delim = token.charAt(0);
                    switch (delim) {
                        case '$':
                            switch (state) {
                                case token: {
                                    // a string literal has started, emit what we have, start a string
                                    printOut(tokens, state);
                                    tokens.clear();
                                    tokens.add(token);
                                    state = State.string;
                                    break;
                                }
                                case string: { // parsing a string, so this ends it
                                    printOut(tokens, state);
                                    tokens.clear();
                                    state = State.token;
                                    break;
                                }
                                case comment: { // $ is ignored since we are in a comment
                                    tokens.add(token);
                                    break;
                                }
                            }
                            break;
                        // ...
                    }
                } else {
                    // not a delimiter token
                    tokens.add(token);
                }
    
            } // end of while
        if (state != State.token) {
           System.out.println("Oops! Syntax error. I'm still parsing" + state);
         }
        }
    }