Search code examples
python-3.xpandasdataframetwitterunicode

Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3


I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.

I put the CSV File into DataFrame,

df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns

one of the tweets is -

b'RT : This little girl dressed as her father for Halloween, a  employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'

But when i access this tweet through the command - df['tweet'][0]

the output is returned in below format -

"b'RT : This little girl dressed as her father for Halloween, a  employee \\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x91\\x8c (via ) '"

I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.

      time                         tweet
0   2018-11-02 05:55:46        b'RT : This little girl dressed as her father for Halloween, a  employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
1   2018-11-02 05:46:41        b'RT : This little girl dressed as her father for Halloween, a  employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
2   2018-11-02 03:44:35        b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map that\xe2\x80\x99s confusing.\xe2\x80\xa6 (via )
3   2018-11-02 03:37:03        b' service is a joke. No service northbound  No service northbound from Navy Yard after a playoff game at 11:30pm. And they\xe2\x80\xa6' 

Screenshot of 'sample.csv'. enter image description here

As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.

Can anyone please explain why this is happening and how to avoid it?

thanks


Solution

  • You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'\xff...' characters.

    So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.

    One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.

    So, after you have your data loaded into your dataframe, this could fix your tweets column:

    import ast
    
    df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)