Search code examples
javautf-16project-panama

Equivalent of MemorySegment.getUtf8String for UTF-16


I'm porting my JNA-based library to "pure" Java using the Foreign Function and Memory API ([JEP 424][1]) in JDK 19.

One frequent use case my library handles is reading (null-terminated) Strings from native memory. For most *nix applications, these are "C Strings" and the MemorySegment.getUtf8String() method is sufficient to the task.

Native Windows Strings, however, are stored in UTF-16 (LE). Referenced as arrays of TCHAR or as "Wide Strings" they are treated similarly to "C Strings" except consume 2 bytes each.

JNA provides a Native.getWideString() method for this purpose which invokes native code to efficiently iterate over the appropriate character set.

I don't see a UTF-16 equivalent to the getUtf8String() (and corresponding set...()) optimized for these Windows-based applications.

I can work around the problem with a few approaches:

  • If I'm reading from a fixed size buffer, I can create a new String(bytes, StandardCharsets.UTF_16LE) and:
    • If I know the memory was cleared before being filled, use trim()
    • Otherwise split() on the null delimiter and extract the first element
  • If I'm just reading from a pointer offset with no knowledge of the total size (or a very large total size I don't want to instantiate into a byte[]) I can iterate character-by-character looking for the null.

While certainly I wouldn't expect the JDK to provide native implementations for every character set, I would think that Windows represents a significant enough usage share to support its primary native encoding alongside the UTF-8 convenience methods. Is there a method to do this that I haven't discovered yet? Or are there any better alternatives than the new String() or character-based iteration approaches I've described?


Solution

  • A charset decoder provides a way to deal with null terminated MemorySegment wide / UTF16_LE to String on Windows using Foreign Memory API. This may not be any different / improvement to your workaround suggestions, as it involves scanning the resulting character buffer for the null position.

    public static String toJavaString(MemorySegment wide) {
        return toJavaString(wide, StandardCharsets.UTF_16LE);
    }
    public static String toJavaString(MemorySegment segment, Charset charset) {
        // JDK Panama only handles UTF-8, it does strlen() scan for 0 in the segment
        // which is valid as all code points of 2 and 3 bytes lead with high bit "1".
        if (StandardCharsets.UTF_8 == charset)
            return segment.getUtf8String(0);
    
        // if (StandardCharsets.UTF_16LE == charset) {
        //     return Holger answer
        // }
    
        // This conversion is convoluted: MemorySegment->ByteBuffer->CharBuffer->String
        CharBuffer cb = charset.decode(segment.asByteBuffer());
    
        // cb.array() isn't valid unless cb.hasArray() is true so use cb.get() to
        // find a null terminator character, ignoring it and the remaining characters
        final int max = cb.limit();
        int len = 0;
        while(len < max && cb.get(len) != '\0')
            len++;
    
        return cb.limit(len).toString();
    }
    

    Going the other way String -> null terminated Windows wide MemorySegment:

    public static MemorySegment toCString(SegmentAllocator allocator, String s, Charset charset) {
        // "==" is OK here as StandardCharsets.UTF_8 == Charset.forName("UTF8")
        if (StandardCharsets.UTF_8 == charset)
            return allocator.allocateUtf8String(s);
    
        // else if (StandardCharsets.UTF_16LE == charset) {
        //     return Holger answer
        // }
    
        // For MB charsets it is safer to append terminator '\0' and let JDK append
        // appropriate byte[] null termination (typically 1,2,4 bytes) to the segment
        return allocator.allocateArray(JAVA_BYTE, (s+"\0").getBytes(charset));
    }
    
    /** Convert Java String to Windows Wide String format */
    public static MemorySegment toWideString(String s, SegmentAllocator allocator) {
        return toCString(allocator, s, StandardCharsets.UTF_16LE);
    }
    

    Like you, I'd also like to know if there are better approaches than the above.

    JDK22 Update

    JDK22 supports conversion of StandardCharsets.XXX, so conversion from Java String to MemorySegment is simply:

    var seg = arena.allocateFrom(str, charset);
    

    A fallback for other character sets uses the approach with appending \0:

    var seg = arena.allocateFrom(JAVA_BYTE, (s+"\0").getBytes(charset));