Search code examples
pythonarrayspandasparquet

Save a pandas dataframe with a column with 2d arrays as a parquet file in python


I'm trying to save a pandas dataframe to a parquet file using pd.to_parquet(df).df is a dataframe with multiple columns and one of the columns is filled with 2d arrays in each row. As I do this, I receive an error from pyarrow complaining that only 1-d arrays are supported. I googled and it seems there is no solution. I just wanted to confirm that in fact there is no solution to this and I have to somehow represent my 2-d array with a 1-d array.


Solution

  • It's correct that pyarrow / parquet has this limitation of not storing 2D arrays.

    But, parquet (and arrow) support nested lists, and you could represent a 2D array as a list of lists (or in python an array of arrays or list of arrays is also fine). So one option could be to convert your 2D arrays to such format.

    Example that such nested lists/arrays work:

    In [2]: df = pd.DataFrame(
       ...:      {'a': [[np.array([1, 2, 3]), np.array([4, 5, 6])],
       ...:             [np.array([3, 4, 5]), np.array([6, 7, 8])]]})
    
    In [3]: df.to_parquet('test_nested_list.parquet') 
    
    In [4]: res = pd.read_parquet('test_nested_list.parquet')
    
    In [5]: res['a']
    Out[5]: 
    0    [[1, 2, 3], [4, 5]]
    1    [[1, 2], [3, 4, 5]]
    Name: a, dtype: object
    
    In [6]: res['a'].values
    Out[6]: 
    array([array([array([1, 2, 3]), array([4, 5, 6])], dtype=object),
           array([array([3, 4, 5]), array([6, 7, 8])], dtype=object)],
          dtype=object)