import pandas as pd
from random import random
from collections import namedtuple
Smoker = namedtuple("Smoker", ["Female","Male"])
Nonsmoker = namedtuple("Nonsmoker", ["Female","Male"])
DF = dict()
DF["A"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF["B"] = [(Smoker(random(),random()), Nonsmoker(random(),random())) for t in range(3)]
DF = pd.DataFrame(DF, index=["t="+str(t+1) for t in range(3)])
I have this dataframe, each of whose cells is a tuple of two namedtuples. After I saved it to csv file and reloaded it, the printed-out looked the same, but each cell became a string. How did it happen? What should I do to obtain the same dataframe every time?
DF.to_csv("results.csv", index_label=False)
df = pd.read_csv('results.csv', index_col=0)
print(df)
for a,b in zip(df.A,df.B):
print(type(a),type(b))
I believe that is expected behaviour. Since csv
is text-base, when you save object
dtype to csv
, the natural way is to use the string representation. So tuple((1,2))
becomes "(1,2)"
.
Now, when you read back csv
file, the natural and safe way to interpret "(1,2)"
is of course the string '(1,2)'
because Pandas doesn't have an engine to parse tuple-valued columns.
TLDR, that's normal and expected behaviour. If you want to save and load your data with object
dtype, you should use binary format such as to_pickle
and from_pickle
methods.