Search code examples
javascriptjavacharacter-encodinggraalvmgraaljs

JavaScript equivalent of Java's String.getBytes(StandardCharsets.UTF_8)


I have the following Java code:

String str = "\u00A0";
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
System.out.println(Arrays.toString(bytes));

This outputs the following byte array:

[-62, -96]

I am trying to get the same result in Javascript. I have tried the solution posted here:

https://stackoverflow.com/a/51904484/12177456

function strToUtf8Bytes(str) {
  const utf8 = [];
  for (let ii = 0; ii < str.length; ii++) {
    let charCode = str.charCodeAt(ii);
    if (charCode < 0x80) utf8.push(charCode);
    else if (charCode < 0x800) {
      utf8.push(0xc0 | (charCode >> 6), 0x80 | (charCode & 0x3f));
    } else if (charCode < 0xd800 || charCode >= 0xe000) {
      utf8.push(0xe0 | (charCode >> 12), 0x80 | ((charCode >> 6) & 0x3f), 0x80 | (charCode & 0x3f));
    } else {
      ii++;
      // Surrogate pair:
      // UTF-16 encodes 0x10000-0x10FFFF by subtracting 0x10000 and
      // splitting the 20 bits of 0x0-0xFFFFF into two halves
      charCode = 0x10000 + (((charCode & 0x3ff) << 10) | (str.charCodeAt(ii) & 0x3ff));
      utf8.push(
        0xf0 | (charCode >> 18),
        0x80 | ((charCode >> 12) & 0x3f),
        0x80 | ((charCode >> 6) & 0x3f),
        0x80 | (charCode & 0x3f),
      );
    }
  }
  return utf8;
}

console.log(strToUtf8Bytes("h\u00A0i"));

But this gives this (which is a https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint8Array):

[194, 160]

This is a problem for me as I'm using the graal js engine, and need to pass the array to a java function that expects a byte[], so any value in the array > 127 will cause an error, as described here:

https://github.com/oracle/graal/issues/2118

Note I also tried the TextEncoder class instead of the strToUtf8Bytes function as described here:

java string.getBytes("UTF-8") javascript equivalent

but it gives the same result as above.

Is there something else I can try here so that I can get JavaScript to generate the same array as Java?


Solution

  • The result is the same in terms of bytes, JS just defaults to unsigned bytes. U in Uint8Array stands for “unsigned”; the signed variant is called Int8Array.

    The conversion is easy: just pass the result to the Int8Array constructor:

    console.log(new Int8Array(new TextEncoder().encode("\u00a0"))); // Int8Array [ -62, -96 ]