Search code examples
c++regexboostbasiccompiler-construction

Do not match if a char is between quotation marks(AKA has a programming string pattern)


I have been assigned to write a compiler for Basic programming language. In basic, codes are separated with new lines or by : mark. e.g to following to codes are valid.
Model# 1

 10 PRINT "Hello World 1" : PRINT "Hello World 2"

Model# 2

 10 PRINT "Hello World 1"
 20 PRINT "Hello World 2"

You can test those here.
The First thing i need to do, before parsing codes in my compiler is to split codes.
I have already splited codes in lines but i am stucked with finding a regex to split The following code sample:
This following code sample should be splited in 2 PRINT codes.

 10 PRINT "Hello World 1" : PRINT "Hello World 2"

But DO NOT match this:
The following code sample is a single standalone command.

 10 PRINT "Hello World 1" ": PRINT Hello World 2"

Question

Any regex pattern to DO match the first of above code samples which : is outside of pair of " and DO NOT match the second one?

Can anybody help me out here?
Any thing would help. :)


Solution

  • Thanks to @Mauren I managed to do what i wanted to do.
    Here is my code(maybe help someone later):
    Note that the source file's content contained in char* buffer and vector<string> source_code.

        /* lines' tokens container */
        std::string token;
        /* Tokenize the file's content into seperate lines */
        /* fetch and tokenizing line version of readed data  and maintain it into the container vector*/
        for(int top = 0, bottom = 0; top < strlen(buffer) ; top++)
        {
            /* inline tokenizing with line breakings */
            if(buffer[top] != '\n' || top == bottom)
            { /* collect current line's tokens */ token += char(buffer[top]); /* continue seeking */continue; }
            /* if we reach here we have collected the current line's tokens */
            /* normalize current tokens */
            boost::algorithm::trim(token);
            /* concurrent statements check point */
            if(token.find(':') != std::string::npos)
            {
                /* a quotation mark encounter flag */
                bool quotation_meet = false;
                /* process entire line from beginning */
                for(int index = 0; true ; index++)
                {
                    /* loop's exit cond. */
                    if(!(index < token.length())) { break; }
                    /* fetch currently processing char */
                    char _char = token[index];
                    /* if encountered  a quotation mark */
                    /* we are moving into a string */
                    /* note that in basic for printing quotation mark, should use `CHR$(34)` 
                     * so there is no `\"` to worry about! :) */
                    if(_char == '"')
                    {
                        /* change quotation meeting flag */
                        quotation_meet = !quotation_meet;
                        /* proceed with other chars. */
                        continue;
                    }
                    /* if we have meet the `:` char and also we are not in a pair quotation*/
                    if(_char == ':' && !quotation_meet)
                    {
                        /* this is the first sub-token of current token */
                        std::string subtoken(token.substr(0, index - 1));
                        /* normalize the sub-token */
                        boost::algorithm::trim(subtoken);
                        /* add sub-token as new line */
                        source_codes.push_back(subtoken);
                        /* replace the rest of sub-token as new token */
                        /**
                         * Note: We keep the `:` mark intentionally, since every code line in BASIC 
                         * should start with a number; by keeping `:` while processing lines starting with `:` means 
                         * they are meant to execute semi-concurrent with previous numbered statement.
                         * So we use following `substr` pattern instead of `token.substr(index + 1, token.length() - 1);`
                         */
                        token = token.substr(index, token.length() - 1);
                        /* normalize the sub-token */
                        boost::algorithm::trim(token);
                        /* reset the index for new token */
                        index = 0;
                        /* continue with other chars */
                        continue;
                    }
                }
                /* if we have any remained token and not empty one? */
                if(token.length())
                    /* a the tokens into collection */
                    goto __ADD_TOKEN;
            }
    __ADD_TOKEN:
            /* if the token is not empty? */
            if(token.length())
                /* add fetched of token to our source code */
                source_codes.push_back(token);
    __NEXT_TOKEN:
            /* move pointer to next tokens' position */
            bottom = top + 1;
            /* clear the token buffer */
            token.clear();
            /* a fail safe for loop */
            continue;
        }
        /* We NOW have our source code departed into lines and saved in a vector */