Search code examples
utf-8iconv

Avoid double conversion when converting windows-1250 to utf8


Possible Duplicate:
How do I convert files between encodings where only some of them are wrong?

I use the following command to convert .srt files from windows-1250 to utf-8 from a folder

for /f "delims=" %%a IN (' dir C:\utf_check\*.srt /b /s ') do %iconv% -s -f windows-1250 -t utf-8 < %%a > %%a.txt

I have a problem with this, if the file is already utf-8, iconv breaks the file inserting strange characters. Is there a way to detect first if is utf-8 or ascii then convert it? I tried with flip, enca, encov, recode with no success.

I use Windows 2003 Server, I installed Cygwin too, maybe that would help.

Example: that is the text found in a utf-8 Aşezaţi-vă. and this is the text after iconv is converting again AĹźezaĹŁi-vÄ.


Solution

  • No sensible text in windows-1250 encoding will ever be valid utf-8. Because the bytes representing characters beyond ASCII range in utf-8 correspond to sequences of characters in windows-1250 that make no sense. So you need to first check whether the file is valid utf-8 and only if it is not, do the conversion.

    You can use the fact, that iconv fails (with errorlevel 1) if it can't do the conversion. So you first run iconv -f utf-8 -t utf-8 and if it fails, run iconv -f windows-1250 -t utf-8.

    Note, that this works only for deciding whether something is utf-8 or legacy encoding but you can't tell between various legacy encodings, because the range of valid characters are the same or mostly so for all windows-anything encodings, so you'd have to do some more advanced heuristics, probably involving spell-checker.