Search code examples
pythonpandasdataframeparquetpyarrow

How to save a pandas dataframe when a column contains sets


When trying to save a pandas dataframe where a column contains set (see example below)

import pandas as pd

df = pd.DataFrame({"col_set": [{"A", "B", "C"}, {"D", "E", "F"}]})
df.to_parquet("df_w_col_set.parquet")

The following error is thrown:

ArrowInvalid: ("Could not convert {'C', 'B', 'A'} with type set: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column col_set with type object')

How can one save this kind of dataframe and avoid the error above?

Some semi related posts mention providing a yarrow schema but I'm not clear on what type to use when consulting pyarrow datatypes.

Code was run with python 3.7.4, pandas==1.3.0 and pyarrow==3.0.0

Mainly looking for a solution where upgrades are not needed or really minimized(to avoid breaking other dependencies).


Solution

  • As workaround, you can convert your set to string and use ast.literal_eval to evaluate your string as set:

    import ast
    
    df.astype({'col_set': str}).to_parquet('data.parquet')
    df1 = pd.read_parquet('data.parquet') \
            .assign(col_set=lambda x: x['col_set'].map(ast.literal_eval))
    print(df1)
    
    # Output
         col_set
    0  {C, B, A}
    1  {F, E, D}
    

    Or you can convert your set to tuple (or list) then revert to set:

    df.assign(col_set=df['col_set'].map(tuple)).to_parquet('test.parquet')
    df1 = pd.read_parquet('test.parquet') \
            .assign(col_set=lambda x: x['col_set'].map(set))
    print(df1)
    
    # Output
         col_set
    0  {C, B, A}
    1  {F, E, D}
    

    You can also use pickle.dumps and pickle.loads to serialize your set:

    import pickle
    
    df.assign(col_set=df['col_set'].map(pickle.dumps)).to_parquet('test.parquet')
    df1 = pd.read_parquet('test.parquet') \
            .assign(col_set=lambda x: x['col_set'].map(pickle.loads))
    print(df1)
    
    # Output
         col_set
    0  {C, B, A}
    1  {F, E, D}
    

    In fact, you can choose any (un)serialization method (except JSON because set does not exist).