I have a <class 'numpy.ndarray'> array that I would like saved to a parquet file to pass to a ML model I'm building. My array has 159573 arrays and each array has 1395 array in each.
Here is a sample of my data:
[[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
[0. 0. 0. ... 0.24093714 0.75547471 0.74532781]
...
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]
[0. 0. 0. ... 0.89473684 0.29282009 0.29277004]]
I tried to convert using this code:
import pyarrow as pa
pa_table = pa.table({"data": Main_x})
pa.parquet.write_table(pa_table, "full_data.parquet")
I get this stacktrace:
5 frames
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.table()
/usr/local/lib/python3.7/dist-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pydict()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.asarray()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib.array()
/usr/local/lib/python3.7/dist-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
/usr/local/lib/python3.7/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: only handle 1-dimensional arrays
I'm wondering if there is a way to save a multi-dimensional array to parquet?
Parquet/Arrow isn't best suited to save this type of data. It's better at dealing with tabular data with a well defined schema and specific columns names and types. In particular the numpy conversion API only supports one dimensional data.
Having that said you can easily convert your 2-d numpy array to parquet, but you need to massage it first.
You're best option is to save it as a table with n columns of m double each.
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
matrix = np.random.rand(10, 100)
arrays = [
pa.array(col) # Create one arrow array per column
for col in matrix
]
table = pa.Table.from_arrays(
arrays,
names=[str(i) for i in range(len(arrays))] # give names to each columns
)
# Save it:
pq.write_table(table, 'table.pq')
# Read it back as numpy:
table_from_parquet = pq.read_table('table.pq')
matrix_from_parquet = table_from_parquet.to_pandas().T.to_numpy()
The intermediate table
has got 10 columns and 100 rows:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----------:|-----------:|-----------:|----------:|-----------:|-----------:|-----------:|----------:|----------:|-----------:|
| 0.45774 | 0.92753 | 0.252345 | 0.982261 | 0.503732 | 0.543526 | 0.22827 | 0.347948 | 0.654259 | 0.152693 |
| 0.287813 | 0.793067 | 0.972282 | 0.739047 | 0.0689906 | 0.102235 | 0.110273 | 0.166839 | 0.907481 | 0.427729 |
| 0.523928 | 0.511737 | 0.473887 | 0.771607 | 0.707633 | 0.276726 | 0.943073 | 0.788174 | 0.305119 | 0.511876 |
| 0.67563 | 0.947449 | 0.895125 | 0.246979 | 0.703503 | 0.256418 | 0.93113 | 0.116715 | 0.330746 | 0.566704 |
| 0.471526 | 0.45332 | 0.546384 | 0.822873 | 0.333542 | 0.518933 | 0.229525 | 0.381977 | 0.893204 | 0.932781 |
...