How to compare words letter by letter in a Unicode string?

I wrote a library to create a crosswords grid, and it works fine (at least as defined) for English words.

However, when I use, for example, Portuguese words like s1 = 'milhão' and s2 = 'sã', if I use 'std::string' the function that tries to find an intersection between s1 and s2 fails. I understood why, as 'ã' is encoded in 2 bytes so the comparison between 's1[4]' and 's2[1]' fails.

If I use 'std::u16string' or 'std::wstring' the function works.

How can I safely compare strings letter by letter, without knowing if the letter is encoded in a single byte or a multi-byte? Should I always use 'std::u32string' if I want my programs to be ready to be used world wide?

The truth is that I never had to worry about localization in my programs, so I am kind of confused.

Here is a program to illustrate my problem:

#include <cstdint>
#include <iostream>
#include <string>

void using_u16() {
  std::u16string _str1(u"milhão");
  std::u16string _str2(u"sã");

  auto _size1{_str1.size()};
  auto _size2{_str2.size()};

  for (decltype(_size2) _i2 = 0; (_i2 < _size2); ++_i2) {
    for (decltype(_size1) _i1 = 0; (_i1 < _size1); ++_i1) {
      if (_str1[_i1] == _str2[_i2]) {
        std::wcout << L"1 - 'milhão' met 'sã' in " << _i1 << ',' << _i2
                   << std::endl;
      }
    }
  }
}

void using_wstring() {
  std::wstring _str1(L"milhão");
  std::wstring _str2(L"sã");

  auto _size1{_str1.size()};
  auto _size2{_str2.size()};

  for (decltype(_size2) _i2 = 0; (_i2 < _size2); ++_i2) {
    for (decltype(_size1) _i1 = 0; (_i1 < _size1); ++_i1) {
      if (_str1[_i1] == _str2[_i2]) {
        std::wcout << L"2 - 'milhão' met 'sã' in " << _i1 << ',' << _i2
                   << std::endl;
      }
    }
  }
}

void using_string() {
  std::string _str1("milhão");
  std::string _str2("sã");

  auto _size1{_str1.size()};
  auto _size2{_str2.size()};

  for (decltype(_size2) _i2 = 0; (_i2 < _size2); ++_i2) {
    for (decltype(_size1) _i1 = 0; (_i1 < _size1); ++_i1) {
      if (_str1[_i1] == _str2[_i2]) {
        std::cout << "3 - 'milhão' met 'sã' in " << _i1 << ',' << _i2
                  << std::endl;
      }
    }
  }
}
int main() {
  using_u16();
  using_wstring();
  using_string();  

  return 0;
}

As I explained, when calling 'using_string()' nothing is printed.

Solution

Depending on how you define a character the requirements for string comparison change.

You could define a character as a specific code point. Many special characters can be represented as a single code point. In this case std::u32string and char32_t are a good fit for your problem. The Rust Language also does this with their chars() iterator, where all char are 4 byte code points (Rust Docs). With the addition of the UTF32 literals in C++11 and simple conversion between UTF8 and UTF32 you have all the necessary tools!

But sometimes the character representations need multiple code points. Some characters even use ambiguous definition, having multiple sequences for the same character. In that case you need more logic behind the comparison and grapheme clusters group code points with logical connection. For example an e followed by an acute accent modifier is grouped optically into a single é. For characters with respectively only single or multi code point that would solve your problem because you can compare the graphemes. For the ambiguous characters with both single and multi code point representations you need a simplification that converts multi code point to single code point if a suitable representation exists. This procedure is called Unicode Normalization and provides a way to stabilize your characters.

Here a demonstration of the concept in rust with the unicode_normalization crate:

fn main() {
    let single_cp = "\u{E9}"; //é
    let multi_cp = "\u{65}\u{301}"; //é
    println!("== RAW ==");
    println!("Printed     : {} {}", single_cp, multi_cp);
    println!("Bytes       : {} {}", single_cp.bytes().len(), multi_cp.bytes().len());
    println!("Code Points : {} {}", single_cp.chars().count(), multi_cp.chars().count());

    let single_cp_norm = single_cp.nfc().to_string();
    let multi_cp_norm = multi_cp.nfc().to_string();
    println!("== NORMALIZED ==");
    println!("Printed     : {} {}", single_cp_norm, multi_cp_norm);
    println!("Bytes       : {} {}", single_cp_norm.bytes().len(), multi_cp_norm.bytes().len());
    println!("Code Points : {} {}", single_cp_norm.chars().count(), multi_cp_norm.chars().count());
}

== RAW ==
Printed     : é é
Bytes       : 2 3
Code Points : 1 2
== NORMALIZED ==
Printed     : é é
Bytes       : 2 2
Code Points : 1 1

The code analyses the single code point (left) and multi code point (right) representation of an optically identical character. In the RAW part you can clearly see that byte and code point count are different even though the are printed the same way. So a byte by byte comparison with std::string and a code point comparison with std::u32string are both ineffective. In the NORMALIZED part the multi code point representation was converted to a single code point so both are equivalent, indicated by the same values for byte and code point count. After the normalization the std::u32string approach would work correctly in all cases where simplification to single code point is possible.

To also accommodate for characters with strictly more than one code point, you can do a normalization first, followed by an equality check based on the grapheme clusters. This way the ambiguous representations collapse into a unified form and the remaining multi code point sequences can be compared. This is probably overengineered for your specific use case! A unicode normalization and equality check based on the simplified code points should be sufficient.

I don't have any experience myself on how an implementation in c++ would look like, but in this stack overflow thread regarding unicode normalization the light weight libraries utfcpp for C++ and utf8proc for C were recommended. There is also a massive library called ICU providing various unicode operations including logical character reading with the BreakIterator and normalization with the Normalizer.

After this post you may realize that unicode and localization are quite complex topics. They are far from beeing solved with only std::u32string in the mix. In the end you make the assumptions on character source and stability and decide how capable your cross-word library should be in handling these cases.

Thanks to @user17732522's feedback I improved the answer!