I am just working with a text file, that contains lots of deformed strings such as:
VyplÅ<88>te prosÃm pole "jméno
My editor says that the file encoding is latin1. The string is supposed to be a czech sentence that contains some diacritics so no wonder it is displayed wrong. I have tried to force utf8 and latin2 encodings in my editor but that did not help. I have also tried to use iconv to convert the file from latin1 to utf8 or latin2 but neither that helped. I quite often encounter issues likes this and I don't know any other solution than to manually rewrite the strings. Is there a better way to fix this?
EDIT:
Here is the original sentence:
Vyplňte prosím pole "jméno"
Here is hex dump of the part where the malformed string occurs:
0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056 jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20 ypl..te pros..m
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b pole "jm..no".';
EDIT2:
The sentence above is really correct utf8 as deceze have said. But I have just found out some strange thing. If I try to transcode the file from utf8 to utf8 (with iconv), I get an error on a word: Postgebühr
at character ü
. If I look at hex dump, this character is represented as \xfc
(252 in decimal), which is valid latin1 byte encoding for ü
but completely invalid utf8 byte encoding. It seems that part of the file is in latin1 and another part in utf8. Here is part of the file that is in latin1 (probably):
0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963 Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20 onf["wafers"] =
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b 'DE: opl..tk .';
As I look into this more, this even does not seem to be valid latin1 cause even in latin1 it is garbled (DE: oplátk Ã
instead of probably DE: oplatky za
). This part of the file seems to contain some damaged text.
I can't understand how encoding in this file could have got mixed up like that. Any ideas?
If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.
The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.
You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.
To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
In response to your posted hex dump: That file is UTF-8 encoded.