When trying to save a pandas dataframe where a column contains set (see example below)
import pandas as pd
df = pd.DataFrame({"col_set": [{"A", "B", "C"}, {"D", "E", "F"}]})
df.to_parquet("df_w_col_set.parquet")
The following error is thrown:
ArrowInvalid: ("Could not convert {'C', 'B', 'A'} with type set: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column col_set with type object')
How can one save this kind of dataframe and avoid the error above?
Some semi related posts mention providing a yarrow schema but I'm not clear on what type to use when consulting pyarrow datatypes.
Code was run with python 3.7.4
, pandas==1.3.0
and pyarrow==3.0.0
Mainly looking for a solution where upgrades are not needed or really minimized(to avoid breaking other dependencies).
As workaround, you can convert your set
to string and use ast.literal_eval
to evaluate your string as set
:
import ast
df.astype({'col_set': str}).to_parquet('data.parquet')
df1 = pd.read_parquet('data.parquet') \
.assign(col_set=lambda x: x['col_set'].map(ast.literal_eval))
print(df1)
# Output
col_set
0 {C, B, A}
1 {F, E, D}
Or you can convert your set to tuple
(or list
) then revert to set
:
df.assign(col_set=df['col_set'].map(tuple)).to_parquet('test.parquet')
df1 = pd.read_parquet('test.parquet') \
.assign(col_set=lambda x: x['col_set'].map(set))
print(df1)
# Output
col_set
0 {C, B, A}
1 {F, E, D}
You can also use pickle.dumps
and pickle.loads
to serialize your set
:
import pickle
df.assign(col_set=df['col_set'].map(pickle.dumps)).to_parquet('test.parquet')
df1 = pd.read_parquet('test.parquet') \
.assign(col_set=lambda x: x['col_set'].map(pickle.loads))
print(df1)
# Output
col_set
0 {C, B, A}
1 {F, E, D}
In fact, you can choose any (un)serialization method (except JSON because set
does not exist).