I need to print - and print only - OsString to user, using system that only accepts UTF-8 strings as input.
How can I convert OsString to utf8 string (one way only) the platform independent way?
I already found crate encoding
, but I cannot find a way to detect base OsString encoding.
After reading through tons of standard library documentation and source code, I've come to these conclusions.
OsString
is a Vec<u8>
and can be any arbitrary bytes, expected to be UTF-8. On Windows OsString
is currently in WTF-8 (source), which is a superset of UTF-8 that allows you to represent ill-formed UTF-16 in specifically-ill-formed UTF-8. It is sufficient to understand this as also being arbitrary bytes resembling UTF-8, as the properties of the WTF-8 encoding are only used if you recover the original bytes (see encode_wide
below). If the same Unicode information in the OS is encoded in valid UTF-8 on Unix and valid UTF-16 on Windows, OsString
will encode both as identical UTF-8. This is all internal and private, but helps to understand the public interface.from_wide
, which converts possibly-ill-formed UTF-16 into the encoding of OsString
(WTF-8). Unix systems use from_vec
, which just wraps the Vec<u8>
. Both are lossless, but from_wide
does change the layout. I believe the only way to create an &OsStr
on Windows without first creating an OsString
is from str
, while on Unix you can also use from_bytes
to create an &OsStr
from a simple &[u8]
slice.OsStr
into an iterator of possibly-ill-formed UTF-16 codepoints with encode_wide
, which losslessly undoes the original conversion. On Unix, you can simply get the bytes back with as_bytes
or into_vec
. It seems like you can't get the bytes as &[u8]
on Windows, which makes sense given the implementation.OsString
values have their validity checked when converting to String
. However, if an OsString
is on Windows and valid Unicode, then converting to String
skips the validity check, as this was done when originally converting to OsString
and stored in the OsString
as a boolean. This boolean is only meaningful if it's true, in which case the OsString
is ensured to be valid UTF-8. If it's false, it may or may not be valid. Unix doesn't have this boolean. That means that if you keep an OsString
around and it's valid, converting it many times to String
will be cheaper on Windows than Unix. However, if you receive many OsString
values from the OS but never convert them to String
, it will be cheaper on Unix than Windows due to Windows' up-front conversion.So essentially, there's three options:
OsString
is already valid UTF-8, and you can get those bytes with into_string
or to_str
.to_string_lossy
to do automatic replacement.encode_wide
on Windows, or as_bytes
or into_vec
on Unix.