Search code examples
unicodeutf-16zig

How do I print a UTF-16 string in Zig?


I've been trying to code a UTF-16 string structure, and although the standard library provides a unicode module, it doesn't seem to provide a way to print out a slice of u16. I've tried this:

const std = @import("std");
const unicode = std.unicode;
const stdout = std.io.getStdOut().outStream();

pub fn main() !void {
    const unicode_str = unicode.utf8ToUtf16LeStringLiteral("😎 hello! 😎");
    try stdout.print("{}\n", .{unicode_str});
}

This outputs:

[12:0]u16@202e9c

Is there a way to print a unicode string ([]u16) without converting it back into a non-unicode string ([]u8)?


Solution

  • Both []const u8 and []const u16 store encoded unicode codepoints. Unicode codepoints fit within the range 0..1,114,112 so an actual unicode string with one array index per codepoint would have to be []const u21. utf-8 and utf-16 both require encoding for codepoints that don't fit. Unless there is a compatability reason for utf-16 (like some windows functions), you should probably be using []const u8 unicode strings.

    To print utf-16 to a utf-8 stream, you have to decode utf-16 and re-encode it into utf-8. There is currently no formatting specifier to do this automatically.

    You can either convert the entire string at once, requiring allocation:

    const utf8string = try std.unicode.utf16leToUtf8Alloc(alloc, utf16le);
    

    Or, without allocation:

    var writer = std.io.getStdOut().writer();
    var it = std.unicode.Utf16LeIterator.init(utf16le);
    while (try it.nextCodepoint()) |codepoint| {
        var buf: [4]u8 = [_]u8{undefined} ** 4;
        const len = try std.unicode.utf8Encode(codepoint, &buf);
        try writer.writeAll(buf[0..len]);
    }
    

    Note that this will be very slow without using a buffered writer if you are writing somewhere that requires a syscall to write.