I want to remove the emojis from a json Telegram bot update parsed with boost property tree
I tried to use the regex pattern from this answer and a few others but I'm not sure how to get them to work in C++ (the below causes a crash): https://stackoverflow.com/a/24674266/2212021
"message":{
"message_id":123,
"from":{
"id":12345,
"first_name":"name",
"username":"username"
},
"chat":{
"id":12345,
"first_name":"name",
"username":"username",
"type":"private"
},
"date":1478144459,
"text":"this is \ud83d\udca9 a sentence"
}
BOOST_FOREACH(const boost::property_tree::ptree::value_type& child, jsontree.get_child("result"))
{
std::string message(child.second.get<std::string>("message.text").c_str());
boost::regex exp("/[\u{1F600}-\u{1F6FF}]/");
std::string message_clean = regex_replace(message, exp, "");
...
}
Exception thrown at 0x00007FF87C1C7788 in CrysisWarsDedicatedServer.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x000000001003F138. Unhandled exception at 0x00007FF87C1C7788 in CrysisWarsDedicatedServer.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x000000001003F138.
The first problem is using .c_str()
on a byte array with arbitrary text encoding. There's no need, so don't do it.
Next, '\u'
is not a valid C++ character escape. Did you mean '\\u'
?
Finally, make sure Boost Regex is compiled with Unicode support and use the appropriate functions.
After spending some time with those documentation pages and also
I came up with
//#define BOOST_HAS_ICU
#include <boost/property_tree/json_parser.hpp>
#include <boost/regex.hpp>
#include <boost/regex/icu.hpp>
#include <iostream>
std::string asUtf8(icu::UnicodeString const& ustr);
std::string sample = R"(
{
"message":{
"message_id":123,
"from":{
"id":12345,
"first_name":"name",
"username":"username"
},
"chat":{
"id":12345,
"first_name":"name",
"username":"username",
"type":"private"
},
"date":1478144459,
"text":"this is \ud83d\udca9 a sentence"
}
}
)";
int main() {
boost::property_tree::ptree pt;
{
std::istringstream iss(sample);
read_json(iss, pt);
}
auto umessage = icu::UnicodeString::fromUTF8(pt.get("message.text", ""));
boost::u32regex exp = boost::make_u32regex("\\p{So}");
auto clean = boost::u32regex_replace(umessage, exp, UnicodeString::fromUTF8("<symbol>"));
std::cout << asUtf8(clean) << "\n";
}
std::string asUtf8(icu::UnicodeString const& ustr) {
std::string r;
{
icu::StringByteSink<std::string> bs(&r);
ustr.toUTF8(bs);
}
return r;
}
This prints:
this is <symbol> a sentence