Search code examples
rustcharacter-encoding

How to convert OsString into utf-8 encoded string in platform-independent way?


I need to print - and print only - OsString to user, using system that only accepts UTF-8 strings as input.

How can I convert OsString to utf8 string (one way only) the platform independent way?

I already found crate encoding, but I cannot find a way to detect base OsString encoding.


Solution

  • After reading through tons of standard library documentation and source code, I've come to these conclusions.

    • On Unix, OsString is a Vec<u8> and can be any arbitrary bytes, expected to be UTF-8. On Windows OsString is currently in WTF-8 (source), which is a superset of UTF-8 that allows you to represent ill-formed UTF-16 in specifically-ill-formed UTF-8. It is sufficient to understand this as also being arbitrary bytes resembling UTF-8, as the properties of the WTF-8 encoding are only used if you recover the original bytes (see encode_wide below). If the same Unicode information in the OS is encoded in valid UTF-8 on Unix and valid UTF-16 on Windows, OsString will encode both as identical UTF-8. This is all internal and private, but helps to understand the public interface.
    • The platform-specific behavior visible to users arises only in conversion.
    • On reception from the OS, Windows systems use from_wide, which converts possibly-ill-formed UTF-16 into the encoding of OsString (WTF-8). Unix systems use from_vec, which just wraps the Vec<u8>. Both are lossless, but from_wide does change the layout. I believe the only way to create an &OsStr on Windows without first creating an OsString is from str, while on Unix you can also use from_bytes to create an &OsStr from a simple &[u8] slice.
    • Windows lets you convert OsStr into an iterator of possibly-ill-formed UTF-16 codepoints with encode_wide, which losslessly undoes the original conversion. On Unix, you can simply get the bytes back with as_bytes or into_vec. It seems like you can't get the bytes as &[u8] on Windows, which makes sense given the implementation.
    • All Unicode-invalid OsString values have their validity checked when converting to String. However, if an OsString is on Windows and valid Unicode, then converting to String skips the validity check, as this was done when originally converting to OsString and stored in the OsString as a boolean. This boolean is only meaningful if it's true, in which case the OsString is ensured to be valid UTF-8. If it's false, it may or may not be valid. Unix doesn't have this boolean. That means that if you keep an OsString around and it's valid, converting it many times to String will be cheaper on Windows than Unix. However, if you receive many OsString values from the OS but never convert them to String, it will be cheaper on Unix than Windows due to Windows' up-front conversion.

    So essentially, there's three options:

    1. The OS gave you a valid string. The OsString is already valid UTF-8, and you can get those bytes with into_string or to_str.
    2. The OS gave you a possibly invalid string. You can check this with the previous methods, or you can call to_string_lossy to do automatic replacement.
    3. If you want to actually know what the OS gave you, perhaps if you want to use a different replacement strategy or if you want to reinterpret the bytes, you can use encode_wide on Windows, or as_bytes or into_vec on Unix.