Search code examples
boostunicodec++03boost-regexboost-propertytree

Strip emojis from Telegram bot update


I want to remove the emojis from a json Telegram bot update parsed with boost property tree

I tried to use the regex pattern from this answer and a few others but I'm not sure how to get them to work in C++ (the below causes a crash): https://stackoverflow.com/a/24674266/2212021

"message":{
   "message_id":123,
   "from":{
      "id":12345,
      "first_name":"name",
      "username":"username"
   },
   "chat":{
      "id":12345,
      "first_name":"name",
      "username":"username",
      "type":"private"
   },
   "date":1478144459,
   "text":"this is \ud83d\udca9 a sentence"
}
BOOST_FOREACH(const boost::property_tree::ptree::value_type& child, jsontree.get_child("result"))
{

        std::string message(child.second.get<std::string>("message.text").c_str());

        boost::regex exp("/[\u{1F600}-\u{1F6FF}]/");
        std::string message_clean = regex_replace(message, exp, "");

        ...
}

Exception thrown at 0x00007FF87C1C7788 in CrysisWarsDedicatedServer.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x000000001003F138. Unhandled exception at 0x00007FF87C1C7788 in CrysisWarsDedicatedServer.exe: Microsoft C++ exception: boost::exception_detail::clone_impl > at memory location 0x000000001003F138.


Solution

  • The first problem is using .c_str() on a byte array with arbitrary text encoding. There's no need, so don't do it.

    Next, '\u' is not a valid C++ character escape. Did you mean '\\u'?

    Finally, make sure Boost Regex is compiled with Unicode support and use the appropriate functions.

    After spending some time with those documentation pages and also

    I came up with

    Live On Wandbox

    //#define BOOST_HAS_ICU
    #include <boost/property_tree/json_parser.hpp>
    #include <boost/regex.hpp>
    #include <boost/regex/icu.hpp>
    #include <iostream>
    std::string asUtf8(icu::UnicodeString const& ustr);
    
    std::string sample = R"(
    {
        "message":{
           "message_id":123,
           "from":{
              "id":12345,
              "first_name":"name",
              "username":"username"
           },
           "chat":{
              "id":12345,
              "first_name":"name",
              "username":"username",
              "type":"private"
           },
           "date":1478144459,
           "text":"this is \ud83d\udca9 a sentence"
        }
    }
    )";
    
    int main() {
    
        boost::property_tree::ptree pt;
        {
            std::istringstream iss(sample);
            read_json(iss, pt);
        }
        auto umessage       = icu::UnicodeString::fromUTF8(pt.get("message.text", ""));
        boost::u32regex exp = boost::make_u32regex("\\p{So}");
    
        auto clean = boost::u32regex_replace(umessage, exp, UnicodeString::fromUTF8("<symbol>"));
    
        std::cout << asUtf8(clean) << "\n";
    }
    
    std::string asUtf8(icu::UnicodeString const& ustr) {
        std::string r;
        {
            icu::StringByteSink<std::string> bs(&r);
            ustr.toUTF8(bs);
        }
    
        return r;
    }
    

    This prints:

    this is <symbol> a sentence