Search code examples
pythonlinuxmacosunicodeutf-8

What is the problem with my Python3 script's parsing of non-ASCII UTF-8 arguments under Linux?


Consider this trivial Python3 script, which simply prints out the contents of sys.argv, both as-is and encoded into UTF-8:

import locale
import sys

print("filesystem encoding is:      ", sys.getfilesystemencoding())
print("local preferred encoding is: ", locale.getpreferredencoding())

print("sys.argv is:")
print(sys.argv)

for a in sys.argv:
   print("Next arg is: ", a)
   print("UTF-8 encoding of arg is: ", a.encode())

If I run this script (via Python 3.11.4) on my Mac (OSX/Ventura 13.3.1/Intel), with an argument that includes a non-ASCII UTF-8 character, I get the expected results:

$ python /tmp/supportfiles/test.py Joe’s
filesystem encoding is:       utf-8
local preferred encoding is:  UTF-8
sys.argv is:
['./test.py', 'Joe’s']
Next arg is:  ./test.py
UTF-8 encoding of arg is:  b'./test.py'
Next arg is:  Joe’s
UTF-8 encoding of arg is:  b'Joe\xe2\x80\x99s'

However, if I run the same command with the same arguments under Linux (Ubuntu 3.19.0, Linux, Python 3.7.0), things go wrong and the script throws a UnicodeEncodeError exception:

filesystem encoding is:       utf-8         
local preferred encoding is:  UTF-8         
sys.argv is:        
['./test.py', 'Joe\udce2\udc80\udc99s']
Next arg is:  ./test.py
UTF-8 encoding of arg is:  b'./test.py'
Next arg is:  Joe’s         
Traceback (most recent call last):          
  File "./test.py", line 13, in <module>
    print("UTF-8 encoding of arg is: ", a.encode())         
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 3-5: surrogates not allowed       

My question is, is this a bug in Python or in my Linux box's localization environment, or am I doing something wrong?

And, is there anything I can to do get my script to correctly handle command line arguments containing non-ASCII characters on all OS's?


Solution

  • It appears that the answer to my question is basically that my embedded Linux environment doesn't handle internationalization properly.

    For posterity's sake, below is the workaround code I inserted into my Python script to "repair" the malformed sys.argv strings into proper UTF-8 that the Python encoder will accept:

    def IsSurrogateMarker(ordVal):
       highByte = (ordVal & 0xFF00) >> 8
       return ((highByte >= 0xD8) and (highByte <= 0xDF))
    
    # Nasty hack work-around for embedded Linux OS
    # mis-encoding UTF-8 argv strings with surrogate bytes
    def FixBuggyString(s):
       pendingBytes = []
       r = ''
       for c in s:
          ordVal = ord(c)
          if (IsSurrogateMarker(ordVal)):
             pendingBytes.append(ordVal & 0xFF)  # strip the bogus surrogate-marker
          elif (len(pendingBytes) > 0):
             r += bytearray(pendingBytes).decode()
             pendingBytes = []
             r += c
          else:
             r += c
       if (len(pendingBytes) > 0):
          r = r + bytearray(pendingBytes).decode()
       return r
    
    if __name__ == "__main__":
        # Work-around for buggy UTF-8 in argv strings 
        for i in range(0, len(sys.argv)):
           sys.argv[i] = self.FixBuggyString(sys.argv[i])
    
        [... remainder of program goes here...]