We have a file that, when opened with normal file readers, such as Notepad++, the emoji is rendered successfully and no extra new lines are added.
The problem we are facing is that, when opening the same file with Python, the UTF-16 bytes are divided into two lines, messing up our Big Data Processing Framework that reads the file in parallel.
We need to understand what makes it clear to Notepad++ that there is not a real new line in the sequence =\xd8\n\xde
so that we can adjust our custom file reader.
STEPS TO REPRODUCE
Copy this emoji 😊 to an empty file and add a new line.
Save the file and open it with Python in bytes format:
# Open the file as bytes:
with open("file_name.csv", "rb") as f:
for line in f:
print(line)
You find there is an extra newline character in the middle of the emoji:
b'\xff\xfe=\xd8\n'
b'\xde\r\x00\n'
b'\x00'
The UTF-8 bytes of U+1F60A actually are (hexadecimal) f0 9f 98 8a
. Note that this does not contain the byte 0A
aka \n
but 8A
.
The UTF-16 (big endian) two-byte chars are: d83d de0a
.
The UTF-16LE (little endian) two-byte chars are: 3dd8 0ade
.
And here is the error: there is a byte 0a
, but the encoding used to read the file is wrong, you are using a byte encoding or such, so it doesn't handle the 0a
correctly.