Search code examples
c++regextokenize

Tokenize a String and Keep Delimiters Using Regular Expression in C++


I would like to modify the given regular expression to produce the following list of matches. I am having a hard time describing the problem in words.

I want to use a regular expression to match a set of 'tokens'. Specifically I want &&,||,;,(,) to be matched, and any string that does not contain those characters should be a match. The problem I am having is distinguishing between one pipe and two pipes. How can i produce the desired matches? Thank you a lot for your help!

Link to this example

The expression:

((&{2})|(\|{2})|(\()|(\))|(;)|[^&|;()]+)

Test String

a < b | c | d > e >> f && ((g) || h) ; i

Expected Matches

a < b | c | d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

Actual Matches

a < b 
|
 c 
|
 d > e >> f 
&&

(
(
g
)

||
 h
)

;
 i

I am trying to implement a custom tokenizer for a program in C++.

Example Code

std::vector<std::string> Parser::tokenizeInput(std::string s) {
    std::vector<std::string> returnTokens;

    //tokenize correctly using this regex
    std::regex rgx(R"S(((&{2})|(\|{2})|(\()|(\))|(;)|[^&|;()]+))S");

    std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), rgx );
    std::regex_iterator<std::string::iterator> rend;

    while (rit!=rend) {

        std::string tokenStr = rit->str();

        if(tokenStr.size() > 0 && tokenStr != " "){
            //assure the token is not blank
            //and push the token
            boost::algorithm::trim(tokenStr);
            returnTokens.push_back(tokenStr);
        }

        ++rit;
    }

    return returnTokens;
}

Example Driver Code

//in main
std::vector<std::string> testVec = Parser::tokenizeInput(inputWithNoComments);
std::cout << "input string: " << inputWithNoComments << std::endl;
std::cout << "tokenized string[";
for(unsigned int i = 0; i < testVec.size(); i++){
    std::cout << testVec[i];
    if ( i + 1 < testVec.size() ) { std::cout << ", "; }
}
std::cout << "]" << std::endl;

Produced Output

input string: (cat file > outFile) || ( ls -l | grep -i )
tokenized string[(, cat file > outFile, ), ||, (, ls -l, grep -i, )]

input string: a && b || c > d >> e < f | g
tokenized string[a, &&, b, ||, c > d >> e < f, g]

input string: foo | bar || foo || bar | foo | bar
tokenized string[foo, bar, ||, foo, ||, bar, foo, bar]

What I Want the Output to be

input string: (cat file > outFile) || ( ls -l | grep -i )
tokenized string[(, cat file > outFile, ), ||, (, ls -l | grep -i, )]

input string: a && b || c > d >> e < f | g
tokenized string[a, &&, b, ||, c > d >> e < f | g]

input string: foo | bar || foo || bar | foo | bar
tokenized string[foo | bar, ||, foo, ||, bar | foo | bar]

Solution

  • I suggest a splitting approach by passing {-1,0} to the sregex_token_iterator to collect both non-matched and matched substrings, and using a much simpler regex like &&|\|\||[;()] while discarding the empty substrings (that are due to the way strings are split when consecutive matches are found):

    std::regex rx(R"(&&|\|\||[();])");
    std::string exp = "a < b | c | d > e >> f && ((g) || h) ; i";
    std::sregex_token_iterator srti(exp.begin(), exp.end(), rx, {-1, 0});
    std::vector<std::string> tokens;
    std::remove_copy_if(srti, std::sregex_token_iterator(), 
                    std::back_inserter(tokens),
                    [](std::string const &s) { return s.empty(); });
    for( auto & p : tokens ) std::cout <<"'"<< p <<"'"<< std::endl;
    

    See the C++ demo, output:

    'a < b | c | d > e >> f '
    '&&'
    ' '
    '('
    '('
    'g'
    ')'
    ' '
    '||'
    ' h'
    ')'
    ' '
    ';'
    ' i'
    

    Special credit for the empty string removal code goes to Jerry Coffin.