Search code examples
c++mysqlopensslsha256utf-16

Reading an array of bytes into UTF-16 characters on a machine with a specific UTF-16 character size


I have a question about utf16_t character interaction and SHA-256 generation with OpenSSL.

The thing is, I'm currently writing code that should deal with password hashing. I've generated a 256-bit hash, and I want to throw it into the database in a UTF-16 encoded character field. In my C++ code, I use char16_t to store such data. However, there is a problem. utf16_t can have more than 16 bytes, depending on the machine it ends up on. And if I use memcpy() to copy bytes from my SHA-256 hash, it may turn out to be a mess on some machines.

What should I do in this situation? Read bytes differently, store hashes in the database differently, maybe something else?


Solution

  • SHA256 generates 256 essentially random bits (32 bytes) of data. It will not always generate valid UTF-16 data.

    You need to somehow encode the 32 bytes into more-than-32 utf-16 bytes to store in your database. Or you can convert the database field to a proper 256-bit binary type

    One of the easier-to-implement ways to store it in your DB as a string would be to map each byte to a character 1-to-1 (and store 32 bytes of data with 32 bytes of zeroes in between):

    unsigned char sha256_hash[256/8];
    get_hash(sha256_hash);
    // encoding
    char16_t db_data[256/8];
    for (int i = 0; i < std::size(db_data); ++i) {
        db_data[i] = char16_t(sha256_hash[i]);
    }
    write_to_db(db_data);
    
    
    char16_t db_data[256/8];
    read_from_db(db_data);
    // decoding
    unsigned char sha256_hash[256/8];
    for (int i = 0; i < std::size(sha256_hash); ++i) {
        assert((std::uint16_t) db_data[i] <= 0xFF);
        sha256_hash[i] = (unsigned char) db_data[i];
    }
    

    Be careful if you are using null-terminated strings though. You will need an extra character for the null terminator and map the 0 byte to something else (0x100 would be a good choice).

    But if you have additional requirements (like it being readable characters), you might consider base64 or a hexadecimal encoding