I've been trying to code a UTF-16 string structure, and although the standard library provides a unicode
module, it doesn't seem to provide a way to print out a slice of u16
.
I've tried this:
const std = @import("std");
const unicode = std.unicode;
const stdout = std.io.getStdOut().outStream();
pub fn main() !void {
const unicode_str = unicode.utf8ToUtf16LeStringLiteral("😎 hello! 😎");
try stdout.print("{}\n", .{unicode_str});
}
This outputs:
[12:0]u16@202e9c
Is there a way to print a unicode string ([]u16
) without converting it back into a non-unicode string ([]u8
)?
Both []const u8
and []const u16
store encoded unicode codepoints. Unicode codepoints fit within the range 0..1,114,112 so an actual unicode string with one array index per codepoint would have to be []const u21
. utf-8 and utf-16 both require encoding for codepoints that don't fit. Unless there is a compatability reason for utf-16 (like some windows functions), you should probably be using []const u8
unicode strings.
To print utf-16 to a utf-8 stream, you have to decode utf-16 and re-encode it into utf-8. There is currently no formatting specifier to do this automatically.
You can either convert the entire string at once, requiring allocation:
const utf8string = try std.unicode.utf16leToUtf8Alloc(alloc, utf16le);
Or, without allocation:
var writer = std.io.getStdOut().writer();
var it = std.unicode.Utf16LeIterator.init(utf16le);
while (try it.nextCodepoint()) |codepoint| {
var buf: [4]u8 = [_]u8{undefined} ** 4;
const len = try std.unicode.utf8Encode(codepoint, &buf);
try writer.writeAll(buf[0..len]);
}
Note that this will be very slow without using a buffered writer if you are writing somewhere that requires a syscall to write.