Search code examples
numpyparquetpyarrow

How can I convert a ndarray/multi-dimensional array to a parquet file?


I have a <class 'numpy.ndarray'> array that I would like saved to a parquet file to pass to a ML model I'm building. My array has 159573 arrays and each array has 1395 array in each.

Here is a sample of my data:

[[0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 [0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 [0.         0.         0.         ... 0.24093714 0.75547471 0.74532781]
 ...
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]
 [0.         0.         0.         ... 0.89473684 0.29282009 0.29277004]]

I tried to convert using this code:

import pyarrow as pa
pa_table = pa.table({"data": Main_x})
pa.parquet.write_table(pa_table, "full_data.parquet")

I get this stacktrace:

5 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.table()

/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()

/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: only handle 1-dimensional arrays

I'm wondering if there is a way to save a multi-dimensional array to parquet?


Solution

  • Parquet/Arrow isn't best suited to save this type of data. It's better at dealing with tabular data with a well defined schema and specific columns names and types. In particular the numpy conversion API only supports one dimensional data.

    Having that said you can easily convert your 2-d numpy array to parquet, but you need to massage it first.

    You're best option is to save it as a table with n columns of m double each.

    import numpy as np
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    matrix = np.random.rand(10, 100)
    arrays = [
        pa.array(col)  # Create one arrow array per column
        for col in matrix
    ]
    
    table = pa.Table.from_arrays(
        arrays,
        names=[str(i) for i in range(len(arrays))] # give names to each columns
    )
    # Save it:
    pq.write_table(table, 'table.pq')
    
    # Read it back as numpy:
    table_from_parquet = pq.read_table('table.pq')
    matrix_from_parquet = table_from_parquet.to_pandas().T.to_numpy()
    
    

    The intermediate table has got 10 columns and 100 rows:

    |         0 |          1 |          2 |         3 |          4 |          5 |          6 |         7 |         8 |          9 |
    |----------:|-----------:|-----------:|----------:|-----------:|-----------:|-----------:|----------:|----------:|-----------:|
    | 0.45774   | 0.92753    | 0.252345   | 0.982261  | 0.503732   | 0.543526   | 0.22827    | 0.347948  | 0.654259  | 0.152693   |
    | 0.287813  | 0.793067   | 0.972282   | 0.739047  | 0.0689906  | 0.102235   | 0.110273   | 0.166839  | 0.907481  | 0.427729   |
    | 0.523928  | 0.511737   | 0.473887   | 0.771607  | 0.707633   | 0.276726   | 0.943073   | 0.788174  | 0.305119  | 0.511876   |
    | 0.67563   | 0.947449   | 0.895125   | 0.246979  | 0.703503   | 0.256418   | 0.93113    | 0.116715  | 0.330746  | 0.566704   |
    | 0.471526  | 0.45332    | 0.546384   | 0.822873  | 0.333542   | 0.518933   | 0.229525   | 0.381977  | 0.893204  | 0.932781   |
    ...