I've copied certain files from a Windows machine to a Linux machine.
All the files encoded with Windows-1252 need to be converted to
UTF-8.
The files which are already in UTF-8 should not be changed.
I'm planning to use the recode
utility for that.
How can I specify that the recode
utility should only convert
windows-1252 encoded files and not the UTF-8 files?
Example usage of recode:
recode windows-1252.. myfile.txt
This would convert myfile.txt
from windows-1252 to UTF-8.
Before doing this, I would like to know that myfile.txt
is actually
windows-1252 encoded and not UTF-8 encoded.
Otherwise, I believe this would corrupt the file.
How would you expect recode to know that a file is Windows-1252? In theory, I believe any file is a valid Windows-1252 file, as it maps every possible byte to a character.
Now there are certainly characteristics which would strongly suggest that it's UTF-8 - if it starts with the UTF-8 BOM, for example - but they wouldn't be definitive.
One option would be to detect whether it's actually a completely valid UTF-8 file first, I suppose... again, that would only be suggestive.
I'm not familiar with the recode tool itself, but you might want to see whether it's capable of recoding a file from and to the same encoding - if you do this with an invalid file (i.e. one which contains invalid UTF-8 byte sequences) it may well convert the invalid sequences into question marks or something similar. At that point you could detect that a file is valid UTF-8 by recoding it to UTF-8 and seeing whether the input and output are identical.
Alternatively, do this programmatically rather than using the recode utility - it would be quite straightforward in C#, for example.
Just to reiterate though: all of this is heuristic. If you really don't know the encoding of a file, nothing is going to tell you it with 100% accuracy.