How to read csv files (with special characters) in Python? How can I decode the text data? Read encoded text from file and convert to string

I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.

import pandas as pd

data = pd.read_csv('C:\Users\Lenovo\Desktop\_Carabinieri_10_tweets.csv', delimiter=",", encoding="utf-8")

data.head()

print(data.head())

Now, the text data is stored like this:

OUTPUT

id … text

0 1228280254256623616 … b'RT @MinisteroDifesa: #14febbraio Il Ministro…

1 1228257366841405441 … b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti…

2 1228235394954620928 … b'Eseguite dai #Carabinieri del Nucleo Investi…

3 1228219588589965316 … b'Il pianeta brucia\nConosci il black carbon?...

4 1228020579485261824 … b'RT @Coninews: Emozioni tricolore \xe2\x9c\xa…

Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. the output looks like bytes. The language is italian.

I tried to decode this using this code (there is more data in other columns, text is in second column). But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[2]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!

How can I decode the text data?

I would be very happy if you can help with this, thank you in advance.

Solution

The problem is likely to come from the way you have written you csv file. I would bet a coin that when read as text (with a simple text editor like notepad, notepad++, or vi) is actually contains:

1228280254256623616,…,b'RT @MinisteroDifesa: #14febbraio Il Ministro...'
1228257366841405441,…,b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'
...

or:

1228280254256623616,…,"b'RT @MinisteroDifesa: #14febbraio Il Ministro...'"
1228257366841405441,…,"b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'"
...

Pandas read_csv then correctly reads the text representation of a byte string.

The correct fix would be to write true UTF-8 encoded strings, but as I do not know the code, I cannot propose a fix.

A possible workaround is to use ast.literal_eval to convert the text representation into a byte string and decode it:

df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode('utf8'))

It should give:

                    id ... text
0  1228280254256623616 ... RT @MinisteroDifesa: #14febbraio Il Ministro...
1  1228257366841405441 ... “Non t’ama chi amor ti...
...