Search code examples
pythonutf-8emoji

How to convert Bytes (UTF-8) embeded emoji in a string


I am scraping a data from WhatsApp chat backup (chat.txt). It looks like this :

7/21/20, 1:31 PM - mark: Can we look google😂😂  
7/21/20, 1:31 PM - elon: No  
7/21/20, 1:31 PM - mark: Can we smile ?  
7/21/20, 1:31 PM - elon: Ya🤩

While I used line by line extraction

with open ('chat.txt','rb') as file:
    for line in file:
        print(str(line.strip()))

I got this:

b'7/21/20, 7:37 AM - mark: Can we look google\xf0\x9f\xa4\xa9\xf0\x9f\x98\x82\xf0\x9f\x98\x82'
b'7/21/20, 7:37 AM - elon: No'
b'7/21/20, 1:31 PM - mark: Can we smile ?'
b'7/21/20, 7:37 AM - elon: Ya\xf0\x9f\x98\x82'
  1. How can we git rid of b'' ? ( I tried .decode('utf-8'), but it didn't work)

  2. How can I convert

    Can we look google\xf0\x9f\xa4\xa9\xf0\x9f\x98\x82\xf0\x9f\x98\x82
    

    to

    Can we look google😂😂?
    

Solution

  • Open the file with the right encoding, not binary mode:

    with open ('chat.txt', encoding='utf8') as file:
        for line in file:
            print(line, end='')
    

    How well this works depends on your execution environment. You need a terminal/IDE and font that support printing the code points for print to be successful, but that is not a Python issue.