Search code examples
c++stringutf-8unsigned

C++ Using u8 strings as unsigned strings


C++20 introduced char8_t as the type for UTF-8 string and character literals.

char8_t is for the most part equivalent to an unsigned char, therefore, any arithmetic, logic, or bitwise operations will act as if it were unsigned, emitting unsigned operations instead of signed operations.

Consider a base64 conversion algorithm, most implementations heavily rely on bitwise operations, where signed operations would usually have incorrect semantics.

One could accept a signed character string, or just a char string with it's unspecified signedness, and reinterpret the string before operating on it, or one could accept an unsigned string.

If I chose to accept unsigned strings, and make a public API (e.g. a user-facing library function), with the signature of the following:

std::u8string base64::encode(std::u8string_view);

would this be wrong? As in, would it imply that the function string is intended to operate on UTF-8 encoded strings, instead of 8-bit ASCII or binary buffers?

I presume the response would be "yes."

I could create aliases to std::basic_string<unsigned char>, std::basic_string_view<unsigned char>, etc, but there would be no way to easily make string literals from them, while one could have easily wrote u8"Hello, world!" and passed it into the function.

So it would be harder to use when using string literals.

Is there a better way to accept and use unsigned strings than this?


Solution

  • std::u8string base64::encode(std::span<std::byte const> binaryData);
    inline std::u8string base64::encode_string(std::u8string_view u8sv) {
      //todo call encode
    }
    inline std::u8string base64::encode_string(std::string_view sv) {
      //todo call encode
    }
    

    take a span of const bytes. Have helper methods that take u8 and char strings.

    base64 encoding is for encoding binary data. You can encode strings in it, and helper methods make that easier.

    I made the helper methods have a different name, to make it clear we are encoding a string as a string. The output of base64::encode can be fed back into base64::encode, but doing so without it being on purpose is going to be an easy bug to pull off.

    Return a u8string, as the result is indeed encoded as utf-8 characters.