I am scraping a data from WhatsApp chat backup (chat.txt). It looks like this :
7/21/20, 1:31 PM - mark: Can we look google😂😂
7/21/20, 1:31 PM - elon: No
7/21/20, 1:31 PM - mark: Can we smile ?
7/21/20, 1:31 PM - elon: Ya🤩
While I used line by line extraction
with open ('chat.txt','rb') as file:
for line in file:
print(str(line.strip()))
I got this:
b'7/21/20, 7:37 AM - mark: Can we look google\xf0\x9f\xa4\xa9\xf0\x9f\x98\x82\xf0\x9f\x98\x82'
b'7/21/20, 7:37 AM - elon: No'
b'7/21/20, 1:31 PM - mark: Can we smile ?'
b'7/21/20, 7:37 AM - elon: Ya\xf0\x9f\x98\x82'
How can we git rid of b''
? ( I tried .decode('utf-8')
, but it didn't work)
How can I convert
Can we look google\xf0\x9f\xa4\xa9\xf0\x9f\x98\x82\xf0\x9f\x98\x82
to
Can we look google😂😂?
Open the file with the right encoding, not binary mode:
with open ('chat.txt', encoding='utf8') as file:
for line in file:
print(line, end='')
How well this works depends on your execution environment. You need a terminal/IDE and font that support printing the code points for print
to be successful, but that is not a Python issue.