how do you remove surrogate values from a std::string in c++? looking for regular expression like this:
string pattern = u8"[\uD800-\uDFFF]";
regex regx(pattern);
name = regex_replace(name, regx, "_");
how do you do it in a c++ multiplatform project (e.g. cmake).
First off, you can't store UTF-16 surrogates in a std::string
(char
-based), you would need std::u16string
(char16_t
-based), or std::wstring
(wchar_t
-based) on Windows only. Javascript strings are UTF-16 strings.
For those string types, you can use either:
std::remove_if()
+ std::basic_string::erase()
:
#include <string>
#include <algorithm>
std::u16string str; // or std::wstring on Windows
...
str.erase(
std::remove_if(str.begin(), str.end(),
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
),
str.end()
);
std::erase_if()
(C++20 and later only):
#include <string>
std::u16string str; // or std::wstring on Windows
...
std::erase_if(str,
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); }
);
UPDATE: You edited your question to change its semantics. Originally, you asked how to remove surrogates, now you are asking how to replace them instead. You can use std::replace_if()
for that task, eg:
#include <string>
#include <algorithm>
std::u16string str; // or std::wstring on Windows
...
std::replace_if(str.begin(), str.end(),
[](char16_t ch){ return (ch >= 0xd800) && (ch <= 0xdfff); },
u'_'
);
Or, if you really want a regex-based approach, you can use std::regex_replace()
, eg:
#include <string>
#include <regex>
std::wstring str; // std::basic_regex does not support char16_t strings!
...
std::wstring newstr = std::regex_replace(
str,
std::wregex(L"[\\uD800-\\uDFFF]"),
L"_"
);