Search code examples
pythonregexsentiment-analysispreprocessordata-cleaning

Cleaning \u2764\ufe0f \u2026 data in file with python


I try to cleaning data twitter in python with regex, but i can't remove \u2764\ufe0f \u2026. twitter data is in the datas.txt file, this is the data:

Berkat biznet aku bisa online terimakasih BiznetHome \u2764\ufe0f Gangguan hari sabtu perbaikan nanti senin hari offline Slow respon \u2764\ufe0f Terima kasih TelkomCare masalah indihome sy sudah terselesaikan terima kasih fast responnya terus selalu tingka\u2026 TelkomCare Sudah beres fix internet dan telpon berfungsi normal thanks atas respons dan perbaikan pihak Indihom\u2026

I have tried three ways :
First

import re

with open ('datas.txt', 'r') as f:
     mylist = [line for line in f]
emoji_pattern = re.compile(r'\\\\u\w+')
for i in mylist:
    print(emoji_pattern.sub(r'', i))


Second

import re
f = open('datas.txt', 'r')
data = f.read()
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)
emoji_pattern.sub(r'', data)


third

f= open("datas.txt", "r", encoding="UTF-8")
datas = f.read()
data = datas.encode('ascii', 'ignore').decode("utf-8")
print(data)

but still not work


Solution

  • Your text file contains non-ASCII Unicode codepoints encoded according to how Python encodes Unicode literals in source code. There are two things you can do with that:

    • Delete all \uXXXX or \UXXXXXXXX sequences from your data. This will remove all Unicode codepoints written in Python literal format, which, in principle (although not necessarily), will be non-ASCII characters. That can be done for example like this:
    import re
    
    with open ('datas.txt', 'r') as f:
         mylist = [line for line in f]
    unicode_literal = re.compile(r'\\u[0-9a-fA-F]{4}|\\U[0-9a-fA-F]{8}')
    for i in mylist:
        print(unicode_literal.sub(r'', i))
    
    • Interpret Unicode code points as their intended value. That is, you will get a string with the non-ASCII data corresponding to the codepoints expressed in the text file. You can do that like this:
    # Note file is read in byte mode
    with open ('datas.txt', 'rb') as f:
         mylist = [line for line in f]
    for i in mylist:
        print(mylist.decode('unicode-escape'))