Search code examples
c++unicodereplaceallsurrogate-pairs

c++: how to remove surrogate unicode values from string?


how do you remove surrogate values from a std::string in c++? looking for regular expression like this:

string pattern = u8"[\uD800-\uDFFF]";
regex regx(pattern);
name = regex_replace(name, regx, "_");

how do you do it in a c++ multiplatform project (e.g. cmake).


Solution

  • First off, you can't store UTF-16 surrogates in a std::string (char-based), you would need std::u16string (char16_t-based), or std::wstring (wchar_t-based) on Windows only. Javascript strings are UTF-16 strings.

    For those string types, you can use either:

    • std::remove_if() + std::basic_string::erase():

      #include <string>
      #include <algorithm>
      
      std::u16string str; // or std::wstring on Windows
      ...
      str.erase(
          std::remove_if(str.begin(), str.end(),
              [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
          ),
          str.end()
      );
      
    • std::erase_if() (C++20 and later only):

      #include <string>
      
      std::u16string str; // or std::wstring on Windows
      ...
      std::erase_if(str,
          [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
      );
      

    UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if() for that task, eg:

    #include <string>
    #include <algorithm>
    
    std::u16string str; // or std::wstring on Windows
    ...
    std::replace_if(str.begin(), str.end(),
        [](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
        u'_'
    );
    

    Or, if you really want a regex-based approach, you can use std::regex_replace(), eg:

    #include <string>
    #include <regex>
    
    std::wstring str; // std::basic_regex does not support char16_t strings!
    ...
    std::wstring newstr = std::regex_replace(
        str,
        std::wregex(L"[\\uD800-\\uDFFF]"),
        L"_"
    );