Search code examples
utf-8character-encoding

Text editor keeps using a wrong file encoding and replaces certain characters with another codes


When on Windows 10 I open a certain file in a Visual Studio Code, and then edit and save the file, the VSC seems to replace certain characters with another characters so that some text in the saved file looks corrupted as shown on the picture below. The default character encoding used in the VSC is UTF-8.

Non-corrupted string before saving the file:
“Diff Clang Compiler Log Files” enter image description here

Corrupted string after saving the file:
�Diff Clang Compiler Log Files� enter image description here

So for example the double quotation mark character " which in the original file is represtented by byte string 0xE2 0x80 0x9C upon saving the file will be converted into 0xEF 0xBF 0xBD. I do not fully understand what the root cause is, but I do have the following assumption:

  1. The original file is saved using the Windows-1252 Encoding (I am using Win 10 machine, German keyboard)
  2. VSC faulty interprets the file with UTF-8 encoding
  3. Characters codes get converted from Windows-1252 into UTF-8 once the file is saved, thus 0xE2 0x80 0x9C becomes 0xEF 0xBF 0xBD.

Is my understanding corrrect?

Can I somehow detect (through powershell or python code) whether a file uses Windows-1252 or UTF-8 encoding? Or there is no definite way to determine that? I would really be glad to find a way on how to avoid corrupting my files in the future :-).

Thank you!


Solution

  • The encoding of the file can be found with the help of python magic module

    import magic
        
    FILE_PATH = 'C:\\myPath'
    
    def getFileEncoding (filePath):
            
            blob = open(filePath, 'rb').read()
            m = magic.Magic(mime_encoding=True)
            fileEncoding = m.from_buffer(blob)
            
            return fileEncoding
            
    fileEncoding = getFileEncoding ( FILE_PATH )
    print (f"File Encoding: {fileEncoding}")