Referred to this question: Emoji crashed when uploading to Big Query
I'm looking for the best and clean way to encode emojis from this \ud83d\ude04
type to this one (Unicode) - \U0001f604
because currently, I do not have any idea except create python method which will pass through a text file and replace emoji coding.
This is the string can be converted:
Converting emojis to Unicode and vice versa in python 3
As an assumption, maybe need to pass through text line by line and convert it??
Potential Idea:
with open(ff_name, 'rb') as source_file:
with open(target_file_name, 'w+b') as dest_file:
contents = source_file.read()
dest_file.write(contents.decode('utf-16').encode('utf-8'))
So, I'll assume that you somehow get a raw ASCII string that contains escape sequences with UTF-16 code units that form surrogate pairs and that you (for whatever reason) want to convert it to \UXXXXXXXX
-format.
So, henceforth I assume that your input (bytes!) looks like this:
weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
Now you want to do the following:
\uXXXX
thingies are transformed into UTF-16 code units. There is raw_unicode_escapes
, but unfortunately it needs a separate pass to fix the surrogate pairs (I don't know why, to be honest)latin_1
, consisting only of good old ASCII with Unicode escape sequences in format \UXXXXXXXX
.Something like this:
output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)
Now if you print(output)
, you get:
hello \U0001f604
Note that if you stop at an intermediate stage:
smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
then you get a Unicode-string with smileys:
print(smiley)
# hello 😄
Full code:
weirdInput = "hello \\ud83d\\ude04".encode("latin_1")
output = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
.encode("raw_unicode_escape")
.decode("latin_1")
)
smiley = (weirdInput
.decode("raw_unicode_escape")
.encode('utf-16', 'surrogatepass')
.decode('utf-16')
)
print(output)
# hello \U0001f604
print(smiley)
# hello 😄