Search code examples
encodingutf-8character-encodinglatin1

understanding file encodings


in eclipse, I have a file where some place this is written:

onclick='obj1.help_open_new_window(fn1(), "/redir/url_name")'

and in eclipse Edit menu->set encoding, I see this:

enter image description here

Now I change the encoding to UTF-8 using the same dialog box and the text changes to:

onclick='obj1.help_open_new_window(fn1(),�"/redir/url_name")'

All I know is if this was not happening, then my website would be working fine. Why is this happening and what do I do to prevent this?

I do have some knowledge about encodings: Â and nbsp mystery explained The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) but still I do not understand why this is happening. Feel free to go to byte level(how file is stored) just to explain it.

UPDATE: Here's what I understand: if the file is encoded in latin-1 then every character is a byte and so is the . it should be hex(32). now when I convert it to utf-8, it still remains hex(32) and that is definitely . this leads me to believe that in latin-1, is not hex(32) but a combination of two bytes. How is that possible?


Solution

  • The character you have between the comma and the quote appear sto not be a normal space but some other whitespace character, probably the famous U+00A0 NO-BREAK SPACE. Since the file is encoded in latin1, the character is stored on disk as the byte \xA0, which does not form a valid character in UTF-8. This means that if you reload the file in your editor using UTF-8 you will see the universal replacement character in its stead. (The proper UTF-8 encoding of no-break space would be \xC2\xA0.)

    To get rid of the problem replace the no-break space with a normal space (U+0020). There is no reason why you should use a no-break space in this context, i.e. in program text.