My question is simple. Start with a character that is not in the basic multilingual plane, say var original = "🎮"
or equivalently
var original=`\u{1f3ae}`
Javascript stores this string in memory via UTF-16 encoding. Unfortunately, you give the string to some database/application (specifics irrelevant) and it mis-interprets the UTF-16 bytes as UTF-8 bytes, and when you read out the string from the database/application what it actually gives you is precisely
var switchedEncoding = Buffer.from(original, 'utf16le').toString('utf8')
If you log switchedEncoding
in this case you get <خ�
. Not good. Okay, so you try to switch it back:
var switchedBack = Buffer.from(switchedEncoding,'utf8').toString('utf16le')
If you log switchedBack
in this case you get �붿
not 🎮
. Bummer.
On the otherhand if your original string is in the BMP, switchedBack
recovers the original just fine. My question is whether or not information is irreversibly lost by the incorrect decoding done by the application/database? If not, I would like a clever function that can invert it even for characters in the astral planes.
Thanks for your help!
Answer was as follows. I could get the database (a leveldb) to read things out into a buffer, and then I used the following approach using the iconv
package in node:
const Iconv = require("iconv").Iconv;
let iconv = new Iconv("UTF-8","UTF-16LE");
let iconv2 = new Iconv("UTF-16LE","UTF-8");
let original = "\u{1f3ae}"
let switched = iconv.convert(original)
let switchedBack = iconv2.convert(switched)
console.log(original)
console.log(switched.toString())
console.log(switchedBack.toString())
// So it's the switched.toString() which is not recoverable;
// switched itself (a Buffer) is;
Good to know that Buffer.toString('someEncoding')
is not always invertible if the encoding of the bytes in the buffer isn't someEncoding
.