Search code examples
iosswiftxcodefoundation

Do String.Encoding.utf16 and String.Encoding. utf16BigEndian mean the same thing i.e UTF16BigEndian?


I have bytes for a string encoded with utf16 big endian. These bytes are read by me from a file shared with me by my colleague who confirms that string is utf16 bigendian.

For demo purpose I read the file to interpret the string. The code is as below:

let bundle = Bundle(for: ViewController.self)
guard let url = bundle.url(forResource: "TestBingEndian", withExtension: "txt") else { return }
let data = try! Data(contentsOf: url)
        print(data)

let bigEndianString = String(bytes: data, encoding: .utf16BigEndian)
print("bigEndianString: \(bigEndianString!)")

let littleEndian = String(bytes: data, encoding: .utf16LittleEndian)
print("littleEndian: \(littleEndian!)")

let endiannessNotSpecifiedString = String(bytes: data, encoding: .utf16)
print("endiannessNotSpecifiedString: \(endiannessNotSpecifiedString!)")

The output for the bigEndianString is what was expected.

The output for littleEndian was not useful as it was garbage for my case.

The output for endiannessNotSpecifiedString was also as expected and matched with the bigEndianString.

So my question is, are .utf16 and .utf16BigEndian the same thing?

PS: My machine is little endian. I thought .utf16 should be what my machines endianness is. But it turns out to be bigendian as per my tests.


Solution

  • So my question is, are .utf16 and .utf16BigEndian the same thing?

    No. The right UTF-16 needs to contain BOM at the top of the file.

    let str = "Hello, World!"
    
    let dataUTF16 = str.data(using: .utf16)!
    print(dataUTF16 as NSData)
    
    let dataUTF16BE = str.data(using: .utf16BigEndian)!
    print(dataUTF16BE as NSData)
    
    let dataUTF16LE = str.data(using: .utf16LittleEndian)!
    print(dataUTF16LE as NSData)
    

    Output:

    <fffe4800 65006c00 6c006f00 2c002000 57006f00 72006c00 64002100>
    <00480065 006c006c 006f002c 00200057 006f0072 006c0064 0021>
    <48006500 6c006c00 6f002c00 20005700 6f007200 6c006400 2100>
    

    0xff, 0xfe represents the BOM in little endian. In big endian, it will be 0xfe, 0xff.

    With .utf16 you can read the right UTF-16 data (I mean having the right BOM), even in a endian mismatching platform.

    Put print(data as NSData) and check the first two bytes of your data. I guess it contains 0xfe, 0xff (BOM in big endian.)


    Seems my guess was wrong, and .utf16 in Apple's Foundation prefers Big Endian than the platform's native endian, when BOM is not found. (Maybe there's some historical reason, as Apple used to use Big Endian platforms, 68k or Power-PC.As with Martin R's comment, it is defined in The Unicode Standard. Seems I need to refresh my knowledge.)

    But you should better specify .utf16BigEndian when you know your data does not contain BOM and in Big Endian, .utf16 for data containing the right BOM.

    let str = "Hello, World!"
    
    let dataUTF16 = str.data(using: .utf16)!
    print(dataUTF16 as NSData)
    
    let strUTF16asUTF16 = String(data: dataUTF16, encoding: .utf16)
    debugPrint(strUTF16asUTF16) //->Optional("Hello, World!")
    let strUTF16asUTF16BE = String(data: dataUTF16, encoding: .utf16BigEndian)
    debugPrint(strUTF16asUTF16BE) //->Optional("䠀攀氀氀漀Ⰰ 圀漀爀氀搀℀")
    let strUTF16asUTF16LE = String(data: dataUTF16, encoding: .utf16LittleEndian)
    debugPrint(strUTF16asUTF16LE) //->Optional("Hello, World!")
    

    When almost all the characters are made of ASCII characters, some sort of predicting the endianness would work, but when most of them are made of non-ASCII characters, such predictions may be wrong.This applies if you are predicting the endianness.

    But in general you should be using the unicode standard which states that if there is No BOM to be found, you should treat the bytes as big endian.