Search code examples
c++regextoken

escaping(\') single quotes in a regex which takes string between two single quotes.


I have the following string:

std::string s("server ('m1.labs.teradata.com') username ('use\\')r_*5') password('u\" er 5') dbname ('default')");

I have used the following code:

int main() {
  std::regex re(R"('[^'\\]*(?:\\[\s\S][^'\\]*)*')");
std::string s("server ('m1.labs.teradata.com') username ('use\\')r_*5') password('u\" er 5') dbname ('default')");
unsigned count = 0;
for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), re);
                         i != std::sregex_iterator();
                         ++i)
{
    std::smatch m = *i;
    cout << "the token is"<<"   "<< m.str() << endl;
    count++;
}
cout << "There were " << count << " tokens found." << endl;
return 0;

}

The output of the above string is :

the token is   'm1.labs.teradata.com'
the token is   'use\')r_*5'
the token is   'u" er 5'
the token is   'default'
There were 4 tokens found.

Now if the string s mentioned above in the code is

std::string s("server ('m1.labs.ter\'adata.com') username ('use\\')r_*5') password('u\" er 5') dbname ('default')");

The output becomes:

the token is   'm1.labs.ter'
the token is   ') username ('
the token is   ')r_*5'
the token is   'u" er 5'
the token is   'default'
There were 5 tokens found.

Now the output for both strings different: The expected output is "extract everything between the parenthesis and single quote i.e

the token is   'm1.labs.teradata.com'
the token is   'use\')r_*5'
the token is   'u" er 5'
the token is   'default'
There were 4 tokens found

The regex which I have mentioned in the code is able to extract properly BUT not able to escape "single quotes". It is able to escape ",) etc but not single quote. Can the regex be modified to produce the output I need. Thanks in advance.


Solution

  • You are using a correct regex I shared yesterday via a comment. It matches single-quoted string literals that may have escaped single quotes inside.

    std::regex re(R"('([^'\\]*(?:\\[\s\S][^'\\]*)*)')");
    std::string s("server ('m1.labs.teradata.com') username ('u\\'se)r_*5') password('uer 5') dbname ('default')");
    unsigned count = 0;
    for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), re);
                             i != std::sregex_iterator();
                             ++i)
    {
        std::smatch m = *i;
        cout << "the token is"<<"   "<< m.str(1) << endl;
        count++;
    }
    cout << "There were " << count << " tokens found." << endl;
    

    Here is my C++ demo

    Note that the literal string ('u\'se)r_*5') should be defined like this with a regular string literal where escape sequences are supported where literal backslashes should be defined with \\:

    "('u\\'se)r_*5')"
    

    or with a raw string literal where backslashes denote literal backslashes:

    R"(('u\'se)r_*5'))"
    

    The R"(...)" forms the raw string literal.

    Pattern details:

    • ' - a single quote
    • [^'\\]* - 0+ chars other than single quote and backslash
    • (?:\\[\s\S][^'\\]*)* - zero or more sequences of:
      • \\[\s\S] - any backslash-escaped char
      • [^'\\]* - 0+ chars other than ' and \
    • ' - a single quote.

    Note that to avoid matching the first single quote as an escaped quote you need to tweak the expression as in this snippet:

    std::regex re(R"((?:^|[^\\])(?:\\{2})*'([^'\\]*(?:\\[\s\S][^'\\]*)*)')");
    std::string s("server ('m1.labs.teradata.com') username ('u\\'se)r_*5') password('uer 5') dbname ('default')");
    unsigned count = 0;
    for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), re);
                             i != std::sregex_iterator();
                             ++i)
    {
        std::smatch m = *i;
        cout << "the token is"<<"   "<< m.str(1) << endl;
        count++;
    }
    cout << "There were " << count << " tokens found." << endl;
    

    The (?:^|[^\\])(?:\\{2})* prefix will match the start of string or any char but \ and then 0+ sequences of 2 \, so no escaped ' will be grabbed at first.

    And finally, if you just need to get a list of matches into a vector, you may use

    #include <iostream>
    #include <string>
    #include <vector>
    #include <regex>
    
    using namespace std;
    
    int main() {
        std::regex rx("'[^']*(?:''[^']*)*'");
        std::string sentence("server ('m1.labs.\\''tera\"da  ta.com') username ('us *(er'')5') password('uer 5') dbname ('default')");
        std::vector<std::string> names(std::sregex_token_iterator(sentence.begin(), sentence.end(), rx),
                                   std::sregex_token_iterator());
    
        for( auto & p : names ) cout << p << endl;
        return 0;
    }
    

    See the C++ demo.