What is the problem with my Python3 script's parsing of non-ASCII UTF-8 arguments under Linux?

Consider this trivial Python3 script, which simply prints out the contents of sys.argv, both as-is and encoded into UTF-8:

import locale
import sys

print("filesystem encoding is:      ", sys.getfilesystemencoding())
print("local preferred encoding is: ", locale.getpreferredencoding())

print("sys.argv is:")
print(sys.argv)

for a in sys.argv:
   print("Next arg is: ", a)
   print("UTF-8 encoding of arg is: ", a.encode())

If I run this script (via Python 3.11.4) on my Mac (OSX/Ventura 13.3.1/Intel), with an argument that includes a non-ASCII UTF-8 character, I get the expected results:

$ python /tmp/supportfiles/test.py Joe’s
filesystem encoding is:       utf-8
local preferred encoding is:  UTF-8
sys.argv is:
['./test.py', 'Joe’s']
Next arg is:  ./test.py
UTF-8 encoding of arg is:  b'./test.py'
Next arg is:  Joe’s
UTF-8 encoding of arg is:  b'Joe\xe2\x80\x99s'

However, if I run the same command with the same arguments under Linux (Ubuntu 3.19.0, Linux, Python 3.7.0), things go wrong and the script throws a UnicodeEncodeError exception:

filesystem encoding is:       utf-8         
local preferred encoding is:  UTF-8         
sys.argv is:        
['./test.py', 'Joe\udce2\udc80\udc99s']
Next arg is:  ./test.py
UTF-8 encoding of arg is:  b'./test.py'
Next arg is:  Joe’s         
Traceback (most recent call last):          
  File "./test.py", line 13, in <module>
    print("UTF-8 encoding of arg is: ", a.encode())         
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 3-5: surrogates not allowed

My question is, is this a bug in Python or in my Linux box's localization environment, or am I doing something wrong?

And, is there anything I can to do get my script to correctly handle command line arguments containing non-ASCII characters on all OS's?

Solution

It appears that the answer to my question is basically that my embedded Linux environment doesn't handle internationalization properly.

For posterity's sake, below is the workaround code I inserted into my Python script to "repair" the malformed sys.argv strings into proper UTF-8 that the Python encoder will accept:

def IsSurrogateMarker(ordVal):
   highByte = (ordVal & 0xFF00) >> 8
   return ((highByte >= 0xD8) and (highByte <= 0xDF))

# Nasty hack work-around for embedded Linux OS
# mis-encoding UTF-8 argv strings with surrogate bytes
def FixBuggyString(s):
   pendingBytes = []
   r = ''
   for c in s:
      ordVal = ord(c)
      if (IsSurrogateMarker(ordVal)):
         pendingBytes.append(ordVal & 0xFF)  # strip the bogus surrogate-marker
      elif (len(pendingBytes) > 0):
         r += bytearray(pendingBytes).decode()
         pendingBytes = []
         r += c
      else:
         r += c
   if (len(pendingBytes) > 0):
      r = r + bytearray(pendingBytes).decode()
   return r

if __name__ == "__main__":
    # Work-around for buggy UTF-8 in argv strings 
    for i in range(0, len(sys.argv)):
       sys.argv[i] = self.FixBuggyString(sys.argv[i])

    [... remainder of program goes here...]