Search code examples
pythonutf-16

How add ‘\u’ to a string and convert it to utf-16 in python?


I’m trying to read an utf-16 text from a GSM module (Sim800 L). It gives me :

0633064406270645 06280647 0647064506af06cc

Rather than :

\u0633\u0644\u0627\u0645 \u0628\u0647 \u0647\u0645\u06af\u06cc

I tried many ways to add ‘\u’ to first string or even convert it to bytes , but every time python recognizes them real ascii characters.

For example:

> Str=r’\u’ + Str
Result: \\u633064406270645 06280647 0647064506af06cc

And because of double backslash python doesn’t recognize it as utf-16

I am looking for any method to convert the output of GSM module to Unicode.


Solution

  • Using a combination of modules

    • re (replace any hexadecimal quadruplet with a suitable character), and
    • json (handle surrogate pairs correctly).

    Note: added a valid surrogate pair (D83DDE0E) as well as a noncharacter (FFFE) to the hard-coded string, merely for debugging purposes:

    import re
    import json
    
    def repl_unicode( matchobj):
        mo_int = int( matchobj.group(0), 16)
        return chr( mo_int)
    
    text_16 = '0633064406270645 06280647 0647064506af06cc D83DDE0E FFFE'
    pattern = '[0-9A-Za-z]{4}'
    text_u8 = json.loads( json.dumps( re.sub( pattern, repl_unicode, text_16)))
    print( text_16)
    print( text_u8)
    print( json.dumps( text_u8, ensure_ascii=True).strip('"'))
    

    Output: .\SO\78628940.py

    0633064406270645 06280647 0647064506af06cc D83DDE0E FFFE
    سلام به همگی 😎 
    \u0633\u0644\u0627\u0645 \u0628\u0647 \u0647\u0645\u06af\u06cc \ud83d\ude0e \ufffe