Search code examples
encodingutf-8character-encodingdata-conversionwindows-1252

Windows-1252 to UTF-8 encoding


I've copied certain files from a Windows machine to a Linux machine.
All the files encoded with Windows-1252 need to be converted to UTF-8.
The files which are already in UTF-8 should not be changed.

I'm planning to use the recode utility for that. How can I specify that the recode utility should only convert windows-1252 encoded files and not the UTF-8 files?

Example usage of recode:

recode windows-1252.. myfile.txt

This would convert myfile.txt from windows-1252 to UTF-8. Before doing this, I would like to know that myfile.txt is actually windows-1252 encoded and not UTF-8 encoded.
Otherwise, I believe this would corrupt the file.


Solution

  • How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.

    Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.

    One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.

    I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.

    Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.

    Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.