Search code examples
pythonunicodetext-filesencodeemoji

Best and clean way to Encode Emojis (Python) from text file


Referred to this question: Emoji crashed when uploading to Big Query

I'm looking for the best and clean way to encode emojis from this \ud83d\ude04 type to this one (Unicode) - \U0001f604 because currently, I do not have any idea except create python method which will pass through a text file and replace emoji coding.

This is the string can be converted:

Converting emojis to Unicode and vice versa in python 3

As an assumption, maybe need to pass through text line by line and convert it??

Potential Idea:

with open(ff_name, 'rb') as source_file:
  with open(target_file_name, 'w+b') as dest_file:
    contents = source_file.read()
    dest_file.write(contents.decode('utf-16').encode('utf-8'))

Solution

  • So, I'll assume that you somehow get a raw ASCII string that contains escape sequences with UTF-16 code units that form surrogate pairs and that you (for whatever reason) want to convert it to \UXXXXXXXX-format.

    So, henceforth I assume that your input (bytes!) looks like this:

    weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
    

    Now you want to do the following:

    1. Interpret the bytes in a way that \uXXXX thingies are transformed into UTF-16 code units. There is raw_unicode_escapes, but unfortunately it needs a separate pass to fix the surrogate pairs (I don't know why, to be honest)
    2. Fix the surrogate pairs, transform the data into valid UTF-16
    3. Decode as valid UTF-16
    4. Again, encode as "raw_unicode_escape"
    5. Decode back as good old latin_1, consisting only of good old ASCII with Unicode escape sequences in format \UXXXXXXXX.

    Something like this:

      output = (weirdInput
        .decode("raw_unicode_escape")
        .encode('utf-16', 'surrogatepass')
        .decode('utf-16')
        .encode("raw_unicode_escape")
        .decode("latin_1")
      )
    

    Now if you print(output), you get:

    hello \U0001f604
    

    Note that if you stop at an intermediate stage:

    smiley = (weirdInput
      .decode("raw_unicode_escape")
      .encode('utf-16', 'surrogatepass')
      .decode('utf-16')
    )
    

    then you get a Unicode-string with smileys:

    print(smiley)
    # hello 😄
    

    Full code:

    weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
    
    output = (weirdInput
      .decode("raw_unicode_escape")
      .encode('utf-16', 'surrogatepass')
      .decode('utf-16')
      .encode("raw_unicode_escape")
      .decode("latin_1")
    )
    
    
    smiley = (weirdInput
      .decode("raw_unicode_escape")
      .encode('utf-16', 'surrogatepass')
      .decode('utf-16')
    )
    
    print(output)
    # hello \U0001f604
    
    print(smiley)
    # hello 😄