Search code examples
pythonwindowsunicoderawbytestring

Convert raw byte string to Unicode without knowing the codepage beforehand


When using the right-click menu context, windows passes file path as raw (byte) string type.

For example:

path = 'C:\\MyDir\\\x99\x8c\x85\x8d.mp3'

Many external packages in my application are expecting unicode type strings, so I have to convert it into unicode.

That would be easy if we'd known the raw string's encoding beforehand (In the example, it is cp1255). However I can't know which encoding will be used locally on each computer around the world.

How can I convert the string into unicode? Perhaps using win32api is needed?


Solution

  • No idea why you might be getting the DOS code page (862) instead of ANSI (1255) - how is the right-click option set up?

    Either way - if you need to accept any arbitrary Unicode character in your arguments you can't do it from Python 2's sys.argv. This list is populated from the bytes returned by the non-Unicode version of the Win32 API (GetCommandLineA), and that encoding is never Unicode-safe.

    Many other languages including Java and Ruby are in the same boat; the limitation comes from the Microsoft C runtime's implementations of the C standard library functions. To fix it, one would call the Unicode version (GetCommandLineW) on Windows instead of relying on the cross-platform standard library. Python 3 does this.

    In the meantime for Python 2, you can do it by calling GetCommandLineW yourself but it's not especially pretty. You can also use CommandLineToArgvW if you want Windows-style parameter splittng. You can do this with win32 extensions or also just plain ctypes.

    Example (though the step of encoding the Unicode string back to UTF-8 bytes is best skipped).