I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.
import pandas as pd
data = pd.read_csv('C:\Users\Lenovo\Desktop\_Carabinieri_10_tweets.csv', delimiter=",", encoding="utf-8")
data.head()
print(data.head())
Now, the text data is stored like this:
OUTPUT
id … text
0 1228280254256623616 … b'RT @MinisteroDifesa: #14febbraio Il Ministro…
1 1228257366841405441 … b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti…
2 1228235394954620928 … b'Eseguite dai #Carabinieri del Nucleo Investi…
3 1228219588589965316 … b'Il pianeta brucia\nConosci il black carbon?...
4 1228020579485261824 … b'RT @Coninews: Emozioni tricolore \xe2\x9c\xa…
Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. the output looks like bytes. The language is italian.
I tried to decode this using this code (there is more data in other columns, text is in second column). But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[2]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!
How can I decode the text data?
I would be very happy if you can help with this, thank you in advance.
The problem is likely to come from the way you have written you csv file. I would bet a coin that when read as text (with a simple text editor like notepad, notepad++, or vi) is actually contains:
1228280254256623616,…,b'RT @MinisteroDifesa: #14febbraio Il Ministro...'
1228257366841405441,…,b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'
...
or:
1228280254256623616,…,"b'RT @MinisteroDifesa: #14febbraio Il Ministro...'"
1228257366841405441,…,"b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'"
...
Pandas read_csv then correctly reads the text representation of a byte string.
The correct fix would be to write true UTF-8 encoded strings, but as I do not know the code, I cannot propose a fix.
A possible workaround is to use ast.literal_eval
to convert the text representation into a byte string and decode it:
df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode('utf8'))
It should give:
id ... text
0 1228280254256623616 ... RT @MinisteroDifesa: #14febbraio Il Ministro...
1 1228257366841405441 ... “Non t’ama chi amor ti...
...