Search code examples
pythonpandasdataframecsvnamedtuple

Why do tuples become strings after saving to csv and reloading the dataframe (pandas)?


import pandas as pd
from random import random
from collections import namedtuple

Smoker    = namedtuple("Smoker", ["Female","Male"])
Nonsmoker = namedtuple("Nonsmoker", ["Female","Male"])

DF = dict() 
DF["A"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF["B"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF = pd.DataFrame(DF, index=["t="+str(t+1) for t in range(3)])

I have this dataframe, each of whose cells is a tuple of two namedtuples. After I saved it to csv file and reloaded it, the printed-out looked the same, but each cell became a string. How did it happen? What should I do to obtain the same dataframe every time?

DF.to_csv("results.csv", index_label=False)
df = pd.read_csv('results.csv', index_col=0)

print(df)

for a,b in zip(df.A,df.B):
    print(type(a),type(b))

Solution

  • I believe that is expected behaviour. Since csv is text-base, when you save object dtype to csv, the natural way is to use the string representation. So tuple((1,2)) becomes "(1,2)".

    Now, when you read back csv file, the natural and safe way to interpret "(1,2)" is of course the string '(1,2)' because Pandas doesn't have an engine to parse tuple-valued columns.

    TLDR, that's normal and expected behaviour. If you want to save and load your data with object dtype, you should use binary format such as to_pickle and from_pickle methods.