Consider this trivial Python3 script, which simply prints out the contents of sys.argv
, both as-is and encoded into UTF-8:
import locale
import sys
print("filesystem encoding is: ", sys.getfilesystemencoding())
print("local preferred encoding is: ", locale.getpreferredencoding())
print("sys.argv is:")
print(sys.argv)
for a in sys.argv:
print("Next arg is: ", a)
print("UTF-8 encoding of arg is: ", a.encode())
If I run this script (via Python 3.11.4) on my Mac (OSX/Ventura 13.3.1/Intel), with an argument that includes a non-ASCII UTF-8 character, I get the expected results:
$ python /tmp/supportfiles/test.py Joe’s
filesystem encoding is: utf-8
local preferred encoding is: UTF-8
sys.argv is:
['./test.py', 'Joe’s']
Next arg is: ./test.py
UTF-8 encoding of arg is: b'./test.py'
Next arg is: Joe’s
UTF-8 encoding of arg is: b'Joe\xe2\x80\x99s'
However, if I run the same command with the same arguments under Linux (Ubuntu 3.19.0, Linux, Python 3.7.0), things go wrong and the script throws a UnicodeEncodeError
exception:
filesystem encoding is: utf-8
local preferred encoding is: UTF-8
sys.argv is:
['./test.py', 'Joe\udce2\udc80\udc99s']
Next arg is: ./test.py
UTF-8 encoding of arg is: b'./test.py'
Next arg is: Joe’s
Traceback (most recent call last):
File "./test.py", line 13, in <module>
print("UTF-8 encoding of arg is: ", a.encode())
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 3-5: surrogates not allowed
My question is, is this a bug in Python or in my Linux box's localization environment, or am I doing something wrong?
And, is there anything I can to do get my script to correctly handle command line arguments containing non-ASCII characters on all OS's?
It appears that the answer to my question is basically that my embedded Linux environment doesn't handle internationalization properly.
For posterity's sake, below is the workaround code I inserted into my Python script to "repair" the malformed sys.argv
strings into proper UTF-8 that the Python encoder will accept:
def IsSurrogateMarker(ordVal):
highByte = (ordVal & 0xFF00) >> 8
return ((highByte >= 0xD8) and (highByte <= 0xDF))
# Nasty hack work-around for embedded Linux OS
# mis-encoding UTF-8 argv strings with surrogate bytes
def FixBuggyString(s):
pendingBytes = []
r = ''
for c in s:
ordVal = ord(c)
if (IsSurrogateMarker(ordVal)):
pendingBytes.append(ordVal & 0xFF) # strip the bogus surrogate-marker
elif (len(pendingBytes) > 0):
r += bytearray(pendingBytes).decode()
pendingBytes = []
r += c
else:
r += c
if (len(pendingBytes) > 0):
r = r + bytearray(pendingBytes).decode()
return r
if __name__ == "__main__":
# Work-around for buggy UTF-8 in argv strings
for i in range(0, len(sys.argv)):
sys.argv[i] = self.FixBuggyString(sys.argv[i])
[... remainder of program goes here...]