Search code examples
c++boost-regex

How to parse escape element '\' and unicode character '\u' using boost regex in C++


I am parsing a text file using boost regex in C++. I am looking for '\' characters from the file. This file also contains some unicode '\u' characters as well. So, is there a way to separate out '\' and '\u' character. Following is content of test.txt that I am parsing

"ID": "\u01FE234DA - this is id ",
"speed": "96\/78",
"avg": "\u01FE234DA avg\83"

Following is my try

#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <fstream>

using namespace std;
const int BUFSIZE = 500;

int main(int argc, char** argv) {

    if (argc < 2) {
        cout << "Pass the input file" << endl;
        exit(0);
    }

   boost::regex re("\\\\+");
   string file(argv[1]);
   char buf[BUFSIZE];

   boost::regex uni("\\\\u+");


   ifstream in(file.c_str());
   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      if (boost::regex_search(buf, re))
      {
          cout << buf << endl;
          cout << "(\) found" << endl;
          if (boost::regex_search(buf, uni)) {
              cout << buf << endl;
              cout << "unicode found" << endl;

          }

      }

   }
}

Now when I use above code it prints following

"ID": "\u01FE234DA - this is id ",
 (\) found
"ID": "\u01FE234DA - this is id ",
 unicode found
"speed": "96\/78",
 (\) found
"avg": "\u01FE234DA avg\83"
 (\) found
 "avg": "\u01FE234DA avg\83"
 unicode found

Instead of I want following

 "ID": "\u01FE234DA - this is id ",
 unicode found
"speed": "96\/78",
 (\) found
 "avg": "\u01FE234DA avg\83"
 (\) and unicode found

I think the code is not able to distinguish '\' and '\u' separately but I am not sure where to change what.


Solution

  • Try using [^u] in your first regex to match any character that is not u.

    boost::regex re("\\\\[^u]");  // matches \ not followed by u
    boost::regex uni("\\\\u");  // matches \u
    

    It's probably best to use one regex expression.

    boost:regex re("\\\\(u)?"); // matches \ with or without u
    

    Then check if the partial match m[1] is 'u':

    m = boost::regex_search(buf, uni)
    if (m && m[1] === "u") {  // pseudo-code
        // unicode
    }
    else {
        // not unicode
    }
    

    It's better to use regex for pattern matching. They seem more complex but they are actually easier to maintain once you get used to them and less bug-prone than iterating over strings one character at a time.