Search code examples
c++regextokentokenizeregex-group

Why is my code not matching regexes for these s-expressions?


I am trying to take an S-expression that contains different variables and tokenize them according to their type. I am pretty new to regex so I am not entirely sure why this only matches parentheses and the else-condition for a variable's type. If you have any idea why my regexes aren't matching the tokens, please let me know!

#include <string>
#include <regex>
#include <iostream>

#define print(var) std::cout << var << std::endl

std::string INT_REGEX = "\b[0-9]{1,3}[0-9]{1,3}\b",
            DOUBLE_REGEX = "\b[0-9]{1,3}.[0-9]{1,3}\b",
            BOOLEAN_REGEX = "^(true|false)$";

bool matchRegex(std::string pattern, std::string inputString) {
    std::regex expression(pattern);
    return std::regex_match(inputString, expression);
}

void detectTokenType(std::string strToken) {
        if (strToken == "(" | strToken == ")")
            print("Parenthesis");
        else if (matchRegex(INT_REGEX, strToken))
            print("Integer");
        else if (matchRegex(DOUBLE_REGEX, strToken))
            print("Double");
        else if (matchRegex(DOUBLE_REGEX, strToken))
            print("Boolean");
        else
            print("Variable name or string");
}

void tokenize(std::string listData) {
    std::vector<char> tokenBuffer;

    for (int i = 0; i < listData.length(); i++) {
        char currChar = listData[i];

        if (i == listData.length() - 1) {
            tokenBuffer.push_back(currChar);
            std::string strToken(tokenBuffer.begin(), tokenBuffer.end());
            detectTokenType(strToken);
        }
        else if (currChar != ' ') {
            tokenBuffer.push_back(currChar);
        }

        else {
            std::string strToken(tokenBuffer.begin(), tokenBuffer.end());
            tokenBuffer.clear();
            detectTokenType(strToken);
        }
    }
}


int main() {
    std::string codeSnippet = "( 2 3.0 true )";
    tokenize(codeSnippet);
    return 0;
}

Solution

  • In your regex strings, you are using \b which is not a word boundary. Instead, you need \\b. Similarly, the . has a special meaning (it's a wildcard that matches any character). If you want to match a literal ., you need \\..

    Also, you are checking for at least 2 digits in the INT_REGEX which is unnecessary:

    std::string INT_REGEX = "\\b[0-9]{1,3}\\b",
                DOUBLE_REGEX = "\\b[0-9]{1,3}\\.[0-9]{1,3}\\b",
                BOOLEAN_REGEX = "^(true|false)$";
    

    Also, you are checking DOUBLE_REGEX for the Boolean case as well, so you need to fix that.

    Here's a demo.