Search code examples
pythonutf-8tweepy

How to read csv files (with special characters) in Python? How can I decode the text data? Read encoded text from file and convert to string


I have used tweepy to store the text of tweets in a csv file using Python csv.writer(), but I had to encode the text in utf-8 before storing, otherwise tweepy throws a weird error.

import pandas as pd

data = pd.read_csv('C:\Users\Lenovo\Desktop\_Carabinieri_10_tweets.csv', delimiter=",", encoding="utf-8")

data.head()

print(data.head())

Now, the text data is stored like this:

OUTPUT

id … text

0 1228280254256623616 … b'RT @MinisteroDifesa: #14febbraio Il Ministro…

1 1228257366841405441 … b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti…

2 1228235394954620928 … b'Eseguite dai #Carabinieri del Nucleo Investi…

3 1228219588589965316 … b'Il pianeta brucia\nConosci il black carbon?...

4 1228020579485261824 … b'RT @Coninews: Emozioni tricolore \xe2\x9c\xa…

Although I used "utf-8" to read the file into a DataFrame with the code shown below, the characters look very different in the output. the output looks like bytes. The language is italian.

I tried to decode this using this code (there is more data in other columns, text is in second column). But, it doesn't decode the text. I cannot use .decode('utf-8') as the csv reader reads data as strings i.e. type(row[2]) is 'str' and I can't seem to convert it into bytes, the data gets encoded once more!

How can I decode the text data?

I would be very happy if you can help with this, thank you in advance.


Solution

  • The problem is likely to come from the way you have written you csv file. I would bet a coin that when read as text (with a simple text editor like notepad, notepad++, or vi) is actually contains:

    1228280254256623616,…,b'RT @MinisteroDifesa: #14febbraio Il Ministro...'
    1228257366841405441,…,b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'
    ...
    

    or:

    1228280254256623616,…,"b'RT @MinisteroDifesa: #14febbraio Il Ministro...'"
    1228257366841405441,…,"b'\xe2\x80\x9cNon t\xe2\x80\x99ama chi amor ti...'"
    ...
    

    Pandas read_csv then correctly reads the text representation of a byte string.

    The correct fix would be to write true UTF-8 encoded strings, but as I do not know the code, I cannot propose a fix.

    A possible workaround is to use ast.literal_eval to convert the text representation into a byte string and decode it:

    df['text'] = df['text'].apply(lambda x: ast.literal_eval(x).decode('utf8'))
    

    It should give:

                        id ... text
    0  1228280254256623616 ... RT @MinisteroDifesa: #14febbraio Il Ministro...
    1  1228257366841405441 ... “Non t’ama chi amor ti...
    ...