Now I'm familiar with Unicode and UTF character encodings. I also know about endianness: an architecture is either little or big endian and they are useful because of performance in low level hardware. But why do we need endians in textfiles? Characters in a file are stored from left to right, even if we use different endians. So according to me the proper endian is the big endian in this case. I go futher: we shouldn't even talk about endians when saving characters to a textfile. So my question is, that why isn't there just one UTF-16 and one UTF-32? Can some one give me example where it is necessary to have both UTF16LE and UTF16BE / UTF32LE and UTF32BE?
For the sake of argument, let us entertain this notion. You define valid UTF-16 as being big-endian. OK, fine.
I am writing code on a machine that is little-endian. I still need to be able to read, understand, and manipulate UTF-16 data. Because I'm using a little-endian processor (using C++ as an example language), char16_t
is little-endian. If I were to bit_cast
that into an array of two characters, the first byte would be the least-significant byte.
So while your interchange format specifies big-endian as the only valid transmission format, within my machine, it isn't useful UTF-16 to me until it is converted into little-endian, where my machine can actually understand the values stored in it. So when I read character data from a valid UTF-16 stream (using your definition of validity), I have to byte swap it before I can make sense of the data.
Now, let's say I want to send UTF-16 via some transmission mechanism (files, internet, etc) to another program/machine. But for whatever reason, I know that the receiving process is definitely going to be running on a little-endian machine.
In order to do this in a way that is valid for your idea of how UTF-16 should be transmitted, I now must do a byte-swap of each UTF-16 code unit, transmit the swapped data, and then byte-swap it at the destination before it can be understood.
The practical reality of the matter is this: I am not going to do that. There is absolutely zero benefit in me doing that. And most important of all... you cannot make me do that.
The reality is this: so long as little endian machines exist and are fairly wide-spread, there will be some practical utility for at least some applications to store/send/receive data in the native UTF-16LE storage format. And so long as there is practical utility in doing a thing, working programmers will do it. You can tell them that they're doing UTF-16 transmission wrong all you like, but they will continue to do it.
So your choices are to make rules that you know won't be followed, or make rules that accept that other people have different ideas about how things should be.
Note that this question is different from that of a more rigid data format. There are binary data formats that explicitly are little-endian or big-endian. But generally, such formats tend to be strongly specified formats that have to conform to a strict set of other criteria. There will often be a conformance testing application that you can use to make sure that your program is generating the file correctly, and writing it in the wrong endian will immediately be seen as "incorrect".
Plain text just doesn't work that way. Nobody shoves their text files through some recognizer, not unless the text itself is expected to conform to a specific format (at which point, it's not "plain text" anymore). For example, XML could have required that UTF-16 encoded text files conform to a specific endian. But plain text is too simplistic for that; there are too many applications that just want to dump a UTF-16 string to a file for that to be realistic.