I am parsing a text file using boost regex in C++. I am looking for '\' characters from the file. This file also contains some unicode '\u' characters as well. So, is there a way to separate out '\' and '\u' character. Following is content of test.txt that I am parsing
"ID": "\u01FE234DA - this is id ",
"speed": "96\/78",
"avg": "\u01FE234DA avg\83"
Following is my try
#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <fstream>
using namespace std;
const int BUFSIZE = 500;
int main(int argc, char** argv) {
if (argc < 2) {
cout << "Pass the input file" << endl;
exit(0);
}
boost::regex re("\\\\+");
string file(argv[1]);
char buf[BUFSIZE];
boost::regex uni("\\\\u+");
ifstream in(file.c_str());
while (!in.eof())
{
in.getline(buf, BUFSIZE-1);
if (boost::regex_search(buf, re))
{
cout << buf << endl;
cout << "(\) found" << endl;
if (boost::regex_search(buf, uni)) {
cout << buf << endl;
cout << "unicode found" << endl;
}
}
}
}
Now when I use above code it prints following
"ID": "\u01FE234DA - this is id ",
(\) found
"ID": "\u01FE234DA - this is id ",
unicode found
"speed": "96\/78",
(\) found
"avg": "\u01FE234DA avg\83"
(\) found
"avg": "\u01FE234DA avg\83"
unicode found
Instead of I want following
"ID": "\u01FE234DA - this is id ",
unicode found
"speed": "96\/78",
(\) found
"avg": "\u01FE234DA avg\83"
(\) and unicode found
I think the code is not able to distinguish '\' and '\u' separately but I am not sure where to change what.
Try using [^u] in your first regex to match any character that is not u.
boost::regex re("\\\\[^u]"); // matches \ not followed by u
boost::regex uni("\\\\u"); // matches \u
It's probably best to use one regex expression.
boost:regex re("\\\\(u)?"); // matches \ with or without u
Then check if the partial match m[1]
is 'u':
m = boost::regex_search(buf, uni)
if (m && m[1] === "u") { // pseudo-code
// unicode
}
else {
// not unicode
}
It's better to use regex for pattern matching. They seem more complex but they are actually easier to maintain once you get used to them and less bug-prone than iterating over strings one character at a time.