I am trying to strip accents from a string using the boost local library.
The normalize function removes the entire character with the accent, i only want to remove the accent.
è -> e for example
Here is my code
std::string hello(u8"élève");
boost::locale::generator gen;
std::string str = boost::locale::normalize(hello,boost::locale::norm_nfd,gen(""));
Desired ouput : eleve
My Output : lve
Help please
That's not what normalize does. With nfd
it does "canonical decomposition". You need to THEN remove the combining character code points.
UPDATE Adding a loose implementation gleaning from the utf8 tables that most combining character appear to lead with 0xcc or 0xcd:
// also liable to strip some greek characters that lead with 0xcd
template <typename Str>
static Str try_strip_diacritics(
Str const& input,
std::locale const& loc = std::locale())
{
using Ch = typename Str::value_type;
using T = boost::locale::utf::utf_traits<Ch>;
auto tmp = boost::locale::normalize(
input, boost::locale::norm_nfd, loc);
auto f = tmp.begin(), l = tmp.end(), out = f;
while (f!=l) {
switch(*f) {
case '\xcc':
case '\xcd': // TODO find more
T::decode(f, l);
break; // skip
default:
out = T::encode(T::decode(f, l), out);
break;
}
}
tmp.erase(out, l);
return tmp;
}
Prints (on my box!):
Before: "élève" 0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
all-in-one: "eleve" 0x65 0x6c 0x65 0x76 0x65
Older answer text/analysis:
#include <boost/locale.hpp>
#include <iomanip>
#include <iostream>
static void dump(std::string const& s) {
std::cout << std::hex << std::showbase << std::setfill('0');
for (uint8_t ch : s)
std::cout << " " << std::setw(4) << int(ch);
std::cout << std::endl;
}
int main() {
boost::locale::generator gen;
std::string const pupil(u8"élève");
std::string const str =
boost::locale::normalize(
pupil,
boost::locale::norm_nfd,
gen(""));
std::cout << "Before: "; dump(pupil);
std::cout << "After: "; dump(str);
}
Prints, on my box:
Before: 0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
After: 0x65 0xcc 0x81 0x6c 0x65 0xcc 0x80 0x76 0x65
However, on Coliru it makes no difference. This indicates that it depends on the available/system locales.
The docs say: https://www.boost.org/doc/libs/1_72_0/libs/locale/doc/html/conversions.html#conversions_normalization
Unicode normalization is the process of converting strings to a standard form, suitable for text processing and comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the diaeresis "¨". Normalization is an important part of Unicode text processing.
Unicode defines four normalization forms. Each specific form is selected by a flag passed to normalize function:
- NFD - Canonical decomposition -
boost::locale::norm_nfd
- NFC - Canonical decomposition followed by canonical composition - boost::locale::norm_nfc or
boost::locale::norm_default
- NFKD - Compatibility decomposition -
boost::locale::norm_nfkd
- NFKC - Compatibility decomposition followed by canonical composition -
boost::locale::norm_nfkc
For more details on normalization forms, read [this article][1].
It seems that you MIGHT get some way by doing the Decomposition only (so NFD) and then removing any code-points that aren't alpha.
This is cheating, because it assumes all code-points are single-unit, which is not generically true, but for the sample it does work:
See improved version above which does iterate over code-points instead of bytes.