Search code examples
node.jsencodingutf-8utf-16utf

Inverting UTF16 to UTF8 conversion for astral characters


My question is simple. Start with a character that is not in the basic multilingual plane, say var original = "🎮" or equivalently

var original=`\u{1f3ae}`

Javascript stores this string in memory via UTF-16 encoding. Unfortunately, you give the string to some database/application (specifics irrelevant) and it mis-interprets the UTF-16 bytes as UTF-8 bytes, and when you read out the string from the database/application what it actually gives you is precisely

var switchedEncoding = Buffer.from(original, 'utf16le').toString('utf8')

If you log switchedEncoding in this case you get <خ�. Not good. Okay, so you try to switch it back:

var switchedBack = Buffer.from(switchedEncoding,'utf8').toString('utf16le')

If you log switchedBack in this case you get �붿 not 🎮. Bummer.

On the otherhand if your original string is in the BMP, switchedBack recovers the original just fine. My question is whether or not information is irreversibly lost by the incorrect decoding done by the application/database? If not, I would like a clever function that can invert it even for characters in the astral planes.

Thanks for your help!


Solution

  • Answer was as follows. I could get the database (a leveldb) to read things out into a buffer, and then I used the following approach using the iconv package in node:

    const Iconv = require("iconv").Iconv;
    let iconv = new Iconv("UTF-8","UTF-16LE");
    let iconv2 = new Iconv("UTF-16LE","UTF-8");
    let original = "\u{1f3ae}"
    let switched = iconv.convert(original)
    let switchedBack = iconv2.convert(switched)
    console.log(original)
    console.log(switched.toString())
    console.log(switchedBack.toString())
    
    
    // So it's the switched.toString() which is not recoverable; 
    // switched itself (a Buffer) is;
    

    Good to know that Buffer.toString('someEncoding') is not always invertible if the encoding of the bytes in the buffer isn't someEncoding.