I try to cleaning data twitter in python with regex, but i can't remove \u2764\ufe0f \u2026
. twitter data is in the datas.txt file, this is the data:
Berkat biznet aku bisa online terimakasih BiznetHome \u2764\ufe0f Gangguan hari sabtu perbaikan nanti senin hari offline Slow respon \u2764\ufe0f Terima kasih TelkomCare masalah indihome sy sudah terselesaikan terima kasih fast responnya terus selalu tingka\u2026 TelkomCare Sudah beres fix internet dan telpon berfungsi normal thanks atas respons dan perbaikan pihak Indihom\u2026
I have tried three ways :
First
import re
with open ('datas.txt', 'r') as f:
mylist = [line for line in f]
emoji_pattern = re.compile(r'\\\\u\w+')
for i in mylist:
print(emoji_pattern.sub(r'', i))
Second
import re
f = open('datas.txt', 'r')
data = f.read()
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u'\U00010000-\U0010ffff'
u"\u200d"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\u3030"
u"\ufe0f"
"]+", flags=re.UNICODE)
emoji_pattern.sub(r'', data)
third
f= open("datas.txt", "r", encoding="UTF-8")
datas = f.read()
data = datas.encode('ascii', 'ignore').decode("utf-8")
print(data)
but still not work
Your text file contains non-ASCII Unicode codepoints encoded according to how Python encodes Unicode literals in source code. There are two things you can do with that:
\uXXXX
or \UXXXXXXXX
sequences from your data. This will remove all Unicode codepoints written in Python literal format, which, in principle (although not necessarily), will be non-ASCII characters. That can be done for example like this:import re
with open ('datas.txt', 'r') as f:
mylist = [line for line in f]
unicode_literal = re.compile(r'\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}')
for i in mylist:
print(unicode_literal.sub(r'', i))
# Note file is read in byte mode
with open ('datas.txt', 'rb') as f:
mylist = [line for line in f]
for i in mylist:
print(mylist.decode('unicode-escape'))