Search code examples
utf-8rustunicode-escapes

Is it possible to decode bytes to UTF-8, converting errors to escape sequences in Rust?


In Rust it's possible to get UTF-8 from bytes by doing this:

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

This either works or it doesn't, but Python has the ability to handle errors, e.g.:

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

In this example the argument surrogateescape converts invalid utf-8 sequences to escape-codes, so instead of ignoring or replacing text that can't be decoded, they are replaced with a byte literal expression, which is valid utf-8. see: Python docs for details.

Does Rust have a way to get a UTF-8 string from bytes which escapes errors instead of failing entirely?


Solution

  • Yes, via String::from_utf8_lossy:

    fn main() {
        let text = [104, 101, 0xFF, 108, 111];
        let s = String::from_utf8_lossy(&text);
        println!("{}", s); // he�lo
    }
    

    If you need more control over the process, you can use std::str::from_utf8, as suggested by the other answer. However, there's no reason to double-validate the bytes as it suggests.

    A quickly hacked-up example:

    use std::str;
    
    fn example(mut bytes: &[u8]) -> String {
        let mut output = String::new();
    
        loop {
            match str::from_utf8(bytes) {
                Ok(s) => {
                    // The entire rest of the string was valid UTF-8, we are done
                    output.push_str(s);
                    return output;
                }
                Err(e) => {
                    let (good, bad) = bytes.split_at(e.valid_up_to());
    
                    if !good.is_empty() {
                        let s = unsafe {
                            // This is safe because we have already validated this
                            // UTF-8 data via the call to `str::from_utf8`; there's
                            // no need to check it a second time
                            str::from_utf8_unchecked(good)
                        };
                        output.push_str(s);
                    }
    
                    if bad.is_empty() {
                        //  No more data left
                        return output;
                    }
    
                    // Do whatever type of recovery you need to here
                    output.push_str("<badbyte>");
    
                    // Skip the bad byte and try again
                    bytes = &bad[1..];
                }
            }
        }
    }
    
    fn main() {
        let r = example(&[104, 101, 0xFF, 108, 111]);
        println!("{}", r); // he<badbyte>lo
    }
    

    You could extend this to take values to replace bad bytes with, a closure to handle the bad bytes, etc. For example:

    fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
        // ...    
                    handler(&mut output, bad);
        // ...
    }
    
    let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
        use std::fmt::Write;
        write!(output, "\\U{{{}}}", bytes[0]).unwrap()
    });
    println!("{}", r); // he\U{255}lo
    

    See also: