Search code examples
swiftunicodeutf-8utf-16utf

Is UTF-16 encoding handles data compression by default?


I've unicode char When I convert id to data,

UTF 8 -> Size: 3 bytes Array: [224, 174, 164]

UTF 16 -> Size: 4 bytes Array: [2980]

Seems pretty simple UTF8 tooks 1 byte per code and UTF16 takes 4 bytes per code. But, If I use "தததத" using Swift programming language in macOS,

let tamil = "தததத"
         
let utf8Data = tamil.data(using: .utf8)!
let utf16Data = tamil.data(using: .utf16)!

print("UTF 8 -> Size: \(utf8Data.count) bytes Array: \(tamil.utf8.map({$0}))")
print("UTF 16 -> Size: \(utf16Data.count) bytes Array: \(tamil.utf16.map({$0}))")

Then the output is

UTF 8 -> Size: 12 bytes Array: [224, 174, 164, 224, 174, 164, 224, 174, 164, 224, 174, 164]

UTF 16 -> Size: 10 bytes Array: [2980, 2980, 2980, 2980]

The UTF16 data for "தததத" => 4x4 = 16 bytes. But it is 10 bytes only still have 4 codes in the array. Why it is? Where the 6 bytes gone?


Solution

  • The actual byte representation of those strings is this:

    UTF-8:

    e0ae a4e0 aea4 e0ae a4e0 aea4
    

    UTF-16:

    feff 0ba4 0ba4 0ba4 0ba4
    

    The UTF-8 representation is e0aea4 times four.
    The UTF-16 representation is 0ba4 times four plus one leading BOM feff.

    UTF-16 text should start with a BOM, but this is only required once at the start of the string, not once for each character.