When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.
Non-corrupted string before saving the file:“Diff Clang Compiler Log Files”
Corrupted string after saving the file:
�Diff Clang Compiler Log Files�
So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:
Is my understanding corrrect?
Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).
Thank you!
The encoding of the file can be found with the help of python magic module
import magic
FILE_PATH = 'C:\\myPath'
def getFileEncoding (filePath):
blob = open(filePath, 'rb').read()
m = magic.Magic(mime_encoding=True)
fileEncoding = m.from_buffer(blob)
return fileEncoding
fileEncoding = getFileEncoding ( FILE_PATH )
print (f"File Encoding: {fileEncoding}")